Skip to content

feat: datamodule#2

Open
xrusnack wants to merge 66 commits into
masterfrom
feature/data/datamodule
Open

feat: datamodule#2
xrusnack wants to merge 66 commits into
masterfrom
feature/data/datamodule

Conversation

@xrusnack
Copy link
Copy Markdown
Member

@xrusnack xrusnack commented Mar 17, 2026

Summary by CodeRabbit

  • New Features
    • Enhanced data loading with weak supervision, configurable train/val/test splits, and predict mode.
    • Automatic weighted sampling to address class imbalance during training.
    • New filtering options to enforce nuclei and positive-count constraints.
    • Batch collation utilities for efficient batching and prediction.
    • Extended training and ML configuration fields for dataset, sampling, and supervision.
  • Style
    • Added typing/schema declarations for dataset, batch, and prediction structures.

@xrusnack xrusnack requested review from matejpekar and vejtek March 17, 2026 00:00
@xrusnack xrusnack self-assigned this Mar 17, 2026
@xrusnack xrusnack requested a review from a team March 17, 2026 00:00
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data handling capabilities of the nuclei_graph project by introducing a dedicated DataModule. This module centralizes the complex logic of data loading, splitting into train/validation/test sets, and applying various supervision strategies for nuclei point cloud datasets. The changes aim to provide a robust, flexible, and reproducible data pipeline, allowing for easier experimentation with different models and training configurations. It also includes mechanisms for handling class imbalance and ensuring data integrity through filtering and structured type definitions.

Highlights

  • New DataModule Implementation: Introduced a new DataModule class to centralize data loading, splitting, and supervision for nuclei point cloud datasets, enhancing data management and reproducibility.
  • Weighted Random Sampling: Implemented an AutoWeightedRandomSampler to address class imbalance during training by dynamically assigning sample weights based on slide positivity scores.
  • Flexible Configuration System: Added new YAML configuration files for experiment training (base, crop-level, nuclei-level) and updated the main ml.yaml to integrate these, providing a more modular and extensible setup.
  • Data Utility Functions: Provided essential utility functions for data collation (collate_fn, collate_fn_predict) and filtering (min_count_filter), streamlining data preparation for PyTorch DataLoaders.
  • Enhanced Type Hinting: Defined comprehensive TypedDict structures (Targets, Crop, Metadata, Batch, etc.) to improve code clarity, maintainability, and static analysis for data flow.
Changelog
  • configs/data/samplers/weighted_random_sampler.yaml
    • Added a new YAML configuration for the AutoWeightedRandomSampler.
  • configs/experiment/modeling/training/base.yaml
    • Added a new base YAML configuration for modeling training, defining common data and metadata parameters.
  • configs/experiment/modeling/training/crop_level.yaml
    • Added a new crop-level training configuration, inheriting from the base and specifying data parameters like sampler_pos_thr and batch_size.
  • configs/experiment/modeling/training/nuclei_level.yaml
    • Added a new nuclei-level training configuration, inheriting from the base and specifying data parameters like sampler_pos_thr and batch_size.
  • configs/ml.yaml
    • Updated the default experiment training to nuclei_level.
    • Expanded the data section with detailed parameters for batch_size, num_workers, mlflow_uris, dataset, and supervision strategies.
  • nuclei_graph/data/data_module.py
    • Added the DataModule class, inheriting from LightningDataModule, to manage data loading and preprocessing.
    • Implemented methods for filtering dataframes, loading supervision data, and preparing datasets for training, validation, testing, and prediction stages.
    • Defined train_dataloader, val_dataloader, test_dataloader, and predict_dataloader methods for PyTorch DataLoaders.
  • nuclei_graph/data/samplers/init.py
    • Added AutoWeightedRandomSampler to the module's public interface.
  • nuclei_graph/data/samplers/weighted_random_sampler.py
    • Added the AutoWeightedRandomSampler class, which extends WeightedRandomSampler to assign weights based on class distribution and positivity thresholds.
  • nuclei_graph/data/utils/init.py
    • Added collate_fn, collate_fn_predict, and min_count_filter to the module's public interface.
  • nuclei_graph/data/utils/collator.py
    • Added collate_fn for batching Crop objects into a Batch structure.
    • Added collate_fn_predict for batching PredictSlide objects into a PredictBatch structure.
  • nuclei_graph/data/utils/filtering.py
    • Added the min_count_filter function to filter dataframes, removing slides that do not meet a minimum nuclei count.
  • nuclei_graph/nuclei_graph_typing.py
    • Added new TypedDict definitions for Targets, Crop, Metadata, Slide, PredictSlide, Batch, and PredictBatch to provide explicit type hints for data structures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f13ba4a5-c9cd-46c8-84b5-2a7a77ac168e

📥 Commits

Reviewing files that changed from the base of the PR and between 4f52159 and 73435c8.

📒 Files selected for processing (5)
  • configs/experiment/modeling/training/base.yaml
  • configs/ml.yaml
  • nuclei_graph/data/data_module.py
  • nuclei_graph/data/utils/collator.py
  • nuclei_graph/nuclei_graph_typing.py
✅ Files skipped from review due to trivial changes (2)
  • configs/experiment/modeling/training/base.yaml
  • nuclei_graph/nuclei_graph_typing.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • configs/ml.yaml

📝 Walkthrough

Walkthrough

Adds a stage-aware Lightning DataModule, sampling utilities, batching/filtering helpers, typed data schemas, and YAML configurations to support slide-level supervision, weighted sampling by slide positivity, and dataloader/collation for training, testing, and prediction workflows.

Changes

Cohort / File(s) Summary
Configuration Files
configs/data/samplers/weighted_random_sampler.yaml, configs/experiment/modeling/training/base.yaml, configs/ml.yaml
New YAML configs: sampler config for AutoWeightedRandomSampler (partial, replacement, placeholder thresholds), base training config with top-level placeholders and hyperparam selectors, and expanded data section in ml.yaml (batch size, workers, metadata placeholders).
Data Module & Typing
nuclei_graph/data/data_module.py, nuclei_graph/nuclei_graph_typing.py
Added DataModule (stage-aware setup, MLflow artifact loading, split/stratify/group handling, supervision building, optional min-count filters, dataset instantiation, and dataloader creation). Added TypedDict types/aliases for Crop/Batch/Metadata/Predict structures.
Sampler Infrastructure
nuclei_graph/data/samplers/weighted_random_sampler.py, nuclei_graph/data/samplers/__init__.py
New AutoWeightedRandomSampler computing per-sample weights from slide-level positivity using positivity_thr, pos_slide_ratio, and replacement. Re-exported via package __all__.
Data Utilities
nuclei_graph/data/utils/collator.py, nuclei_graph/data/utils/filtering.py, nuclei_graph/data/utils/__init__.py
Added collate_fn and collate_fn_predict for batching slides/crops and stacking targets; added min_count_filter and min_positive_count_filter for slide-level filtering; exported utilities via module __all__.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DataModule
    participant MLflow
    participant Dataset as NucleiDataset
    participant Sampler as AutoWeightedRandomSampler
    participant Collator

    User->>DataModule: setup(stage="fit")
    DataModule->>MLflow: download metadata & supervision parquet
    MLflow-->>DataModule: metadata DataFrames
    DataModule->>DataModule: train_test_split (stratify/group if set)
    DataModule->>DataModule: apply min_count / min_positive filters
    DataModule->>DataModule: build supervision (train_strategy)
    DataModule->>Dataset: instantiate train/val datasets (full_slide per stage)

    User->>DataModule: train_dataloader()
    DataModule->>Sampler: init with dataset + slides_positivity + params
    Sampler->>Sampler: compute per-sample weights
    DataModule->>DataModule: create DataLoader (sampler/collate_fn)
    DataModule-->>User: return DataLoader

    loop Training loop
        User->>DataModule: request next batch
        DataModule->>Dataset: get samples
        DataModule->>Collator: collate_fn(batch)
        Collator-->>DataModule: batched tensors
        DataModule-->>User: yield batch
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • matejpekar
  • vejtek

Poem

🐰 I hopped through configs, datasets, and code so neat,
Slides split like gardens where supervision meets,
I nudged weights toward positives with a gentle cheer,
Batches stacked tidy for each training near,
A rabbit applauds — the pipeline hops ahead! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: datamodule' is directly related to the primary changes in this PR. The changeset introduces a DataModule class, configuration files for data handling, samplers, utilities, and type definitions—all core components of a data module infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/data/datamodule

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive LightningDataModule for handling nuclei graph data, including data loading, supervision strategies, and sampling. The implementation is well-structured with clear separation of concerns into different modules for data handling, sampling, and utilities. My review focuses on improving robustness, fixing a minor formatting bug, and enhancing clarity in type definitions. Overall, this is a solid feature addition.

Comment thread nuclei_graph/data/data_module.py
Comment thread nuclei_graph/data/samplers/weighted_random_sampler.py Outdated
Comment thread nuclei_graph/data/utils/filtering.py
Comment thread nuclei_graph/nuclei_graph_typing.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
nuclei_graph/data/samplers/weighted_random_sampler.py (1)

37-48: Harden weight computation for missing slide IDs and degenerate class splits.

Line 40 assumes full key coverage in slides_positivity. Adding explicit validation and handling single-class cases makes failures clearer and behavior more predictable.

Proposed robustness patch
     ) -> Sequence[float]:
         slide_ids = dataset.slides["slide_id"].values
+        missing = [sid for sid in slide_ids if sid not in slides_positivity]
+        if missing:
+            raise ValueError(
+                f"Missing positivity scores for {len(missing)} slide(s), e.g. {missing[:5]}"
+            )
+
         labels = torch.tensor(
             [
                 1.0 if slides_positivity[slide_id] > positivity_thr else 0.0
                 for slide_id in slide_ids
             ]
         )
-        positive = labels.sum()
-        negative = len(labels) - positive
+        positive = int(labels.sum().item())
+        negative = len(labels) - positive
+        if positive == 0 or negative == 0:
+            return torch.ones(len(labels), dtype=torch.float32).tolist()
+
         weights = torch.zeros_like(labels)
         weights[labels == 0] = 1.0 / negative
         weights[labels == 1] = 1.0 / positive
         return weights.tolist()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/samplers/weighted_random_sampler.py` around lines 37 - 48,
Validate that every slide_id in slide_ids exists in slides_positivity before
computing labels and handle missing keys explicitly (either by raising a clear
error referencing slides_positivity and slide_id or by assigning a default
positivity value). After computing labels (using slides_positivity and
positivity_thr), guard the subsequent class counts positive and negative against
zero to avoid division by zero: if one class is absent, assign sensible fallback
weights (e.g., uniform weights or all mass to the present class) instead of
dividing by zero. Ensure these checks are applied around the computations of
labels, positive, negative, and weights so weights[labels==0] and
weights[labels==1] are always safe to compute.
nuclei_graph/data/utils/filtering.py (1)

14-17: Use structured logging instead of print for dropped-slide reporting.

Lines 14-17 currently print directly; switching to a logger makes this controllable in training/inference environments.

Proposed refactor
 import pandas as pd
+import logging
+
+logger = logging.getLogger(__name__)
@@
-        print(
-            f"[INFO] Dropped slides with < {min_count} nuclei:\n",
-            dropped_slides.to_string(index=False),
-        )
+        logger.info(
+            "Dropped slides with < %s nuclei:\n%s",
+            min_count,
+            dropped_slides.to_string(index=False),
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/utils/filtering.py` around lines 14 - 17, Replace the
direct print call that reports dropped slides with structured logging: add a
module-level logger (e.g., logger = logging.getLogger(__name__)) and change the
print(...) to logger.info(...) using a formatted message and parameters (include
min_count and dropped_slides.to_string(index=False) in the log). Ensure the
logging module is imported and that the log message is concise and structured
rather than using print, so dropped_slides and min_count are emitted via
logger.info instead of print.
nuclei_graph/data/data_module.py (1)

148-151: Prefer explicit runtime exceptions over assert for data/config validation.

Lines 148, 151, and 189 are runtime contract checks; explicit exceptions keep behavior consistent even when assertions are disabled.

Proposed refactor
-                assert self.train_strategy is not None
+                if self.train_strategy is None:
+                    raise ValueError("train_strategy must be set for stage='fit'/'validate'.")
@@
-                assert slides_df is not None
+                if slides_df is None:
+                    raise ValueError(f"Failed to load training metadata from URI: {slides_uri}")
@@
-                assert slides_df is not None
+                if slides_df is None:
+                    raise ValueError(f"Failed to load metadata for stage='{stage}' from URI: {slides_uri}")

Also applies to: 189-189

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/data_module.py` around lines 148 - 151, Replace the runtime
asserts with explicit exceptions: instead of "assert self.train_strategy is not
None" raise a clear error (e.g. ValueError or RuntimeError) indicating that
train_strategy must be configured; instead of "assert slides_df is not None"
raise a descriptive exception (e.g. FileNotFoundError or ValueError) that
includes slides_uri and notes the failure of _load_df to return data; and apply
the same pattern to the other runtime contract assert in this module (the assert
referenced at line 189) so all validation failures use explicit exceptions with
helpful messages.
configs/ml.yaml (1)

4-4: Prefer a functional selector over inline profile comments.

Line 4’s # crop_level nuclei_level comment is easy to stale. Consider a dedicated selector variable (or documented override example) instead of a free-form inline note.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/ml.yaml` at line 4, Replace the inline comment on the
/experiment/modeling/training entry with a dedicated selector variable (e.g.,
add a new key like training_selector or training_mode) and document allowed
values (crop_level, nuclei_level) so callers can override explicitly; update any
code reading /experiment/modeling/training to prefer the new selector
(training_selector or training_mode) and fall back to the existing value only if
the selector is absent so behavior remains backward-compatible.
configs/experiment/modeling/training/crop_level.yaml (1)

3-27: Consider extracting a shared training template.

This file is almost identical to nuclei_level.yaml except metadata naming; a shared base template would reduce drift and maintenance overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/experiment/modeling/training/crop_level.yaml` around lines 3 - 27,
This config largely duplicates another config (e.g., identical defaults and
trainer/data/model blocks) — extract the common pieces into a shared base config
(e.g., experiment/modeling/training/shared_base.yaml) and have crop_level.yaml
and nuclei_level.yaml include that base via the defaults list; move repeated
keys (defaults, trainer, data.sampler_pos_thr, balance_sampling, batch_size,
split_size, model) into the shared base and keep only overrides in
crop_level.yaml (e.g., metadata.run_name and any true diffs), updating the
defaults entry instead of duplicating full blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@configs/data/samplers/weighted_random_sampler.yaml`:
- Line 3: The sampler config contains an unresolved required Hydra value
`positivity_thr: ???` which causes MissingMandatoryValue; replace the
placeholder by wiring it to the experiments' parameter (e.g., set
`positivity_thr: ${data.sampler_pos_thr}` in the sampler config) or
alternatively explicitly set `positivity_thr` in each experiment config
(`nuclei_level.yaml` and `crop_level.yaml`) so the sampler sees a concrete
float; target the `positivity_thr` key in the sampler config or the
`data.sampler_pos_thr` keys in the experiment configs when making the change.

In `@configs/experiment/modeling/training/base.yaml`:
- Around line 19-22: The base hyperparameters use null-safe oc.select for
sampler_positivity_threshold but not for batch_size, split_size, and
balance_sampling; update the interpolations for batch_size, split_size, and
balance_sampling to use the same null-safe pattern (oc.select) so all four
keys—sampler_positivity_threshold, batch_size, split_size, balance_sampling—are
resolved via ${oc.select:... , null} to avoid interpolation errors when composed
configs omit them.

In `@configs/experiment/modeling/training/crop_level.yaml`:
- Around line 6-12: The YAML uses explicit nulls that override defaults composed
earlier — when _self_ is used the entries like /data/datasets@data.dataset and
/model/meta_archs@model (and the nulls at data.dataset and model later) are
being set to null instead of deferring to the composed defaults; remove the
explicit nulls or replace them with Hydra delete syntax (~data.dataset, ~model)
if you intend to remove those keys, and ensure _self_ remains last so default
selections from /model/meta_archs@model, /model/models@model.net,
/data/datasets@data.dataset, /data/samplers@data.sampler, and
/data/supervision@data.train_strategy|data.eval_strategy are not overwritten by
null values.

In `@configs/experiment/modeling/training/nuclei_level.yaml`:
- Around line 6-12: The YAML composition places `_self_` last causing null
entries for `model` and `data.dataset` to override earlier composed defaults;
move or merge `_self_` so that entries from `/model/meta_archs@model`,
`/model/models@model.net`, and `/data/datasets@data.dataset` are not wiped.
Concretely, ensure `_self_` appears before default includes or remove the
explicit null keys under `model` and `data.dataset` in this file so that the
composed values from those selectors are preserved (check references
`/model/meta_archs@model`, `/model/models@model.net`,
`/data/datasets@data.dataset`, and the `_self_` anchor).

In `@nuclei_graph/data/data_module.py`:
- Around line 163-175: After calling min_count_filter on train_df, add a
fail-fast check immediately after that line to detect an empty train_df and
raise a clear exception (or ValueError) that explains that min_count_filter
removed all training slides and includes the dataset_cfg/crop_size context; this
should occur before calling _load_supervision_dfs, _prepare_supervision, or
instantiate so train_sup and train_dataset are not constructed from an empty
DataFrame.

In `@nuclei_graph/data/utils/collator.py`:
- Around line 18-21: The current code builds graph_targets by filtering out None
which breaks per-sample alignment; update the collator logic in collator.py
(variables graph_targets and batched_y construction) to first collect
graph_entries = [b["y"]["graph"] for b in batch] and then enforce all-or-none
graph supervision: if any(entry is None for entry in graph_entries) and not
all(entry is None for entry in graph_entries) raise a ValueError, if all are
None set batched_y["graph"]=None, otherwise set
batched_y["graph"]=torch.stack(graph_entries, dim=0) so ordering and batch
alignment are preserved for the graph target.

---

Nitpick comments:
In `@configs/experiment/modeling/training/crop_level.yaml`:
- Around line 3-27: This config largely duplicates another config (e.g.,
identical defaults and trainer/data/model blocks) — extract the common pieces
into a shared base config (e.g., experiment/modeling/training/shared_base.yaml)
and have crop_level.yaml and nuclei_level.yaml include that base via the
defaults list; move repeated keys (defaults, trainer, data.sampler_pos_thr,
balance_sampling, batch_size, split_size, model) into the shared base and keep
only overrides in crop_level.yaml (e.g., metadata.run_name and any true diffs),
updating the defaults entry instead of duplicating full blocks.

In `@configs/ml.yaml`:
- Line 4: Replace the inline comment on the /experiment/modeling/training entry
with a dedicated selector variable (e.g., add a new key like training_selector
or training_mode) and document allowed values (crop_level, nuclei_level) so
callers can override explicitly; update any code reading
/experiment/modeling/training to prefer the new selector (training_selector or
training_mode) and fall back to the existing value only if the selector is
absent so behavior remains backward-compatible.

In `@nuclei_graph/data/data_module.py`:
- Around line 148-151: Replace the runtime asserts with explicit exceptions:
instead of "assert self.train_strategy is not None" raise a clear error (e.g.
ValueError or RuntimeError) indicating that train_strategy must be configured;
instead of "assert slides_df is not None" raise a descriptive exception (e.g.
FileNotFoundError or ValueError) that includes slides_uri and notes the failure
of _load_df to return data; and apply the same pattern to the other runtime
contract assert in this module (the assert referenced at line 189) so all
validation failures use explicit exceptions with helpful messages.

In `@nuclei_graph/data/samplers/weighted_random_sampler.py`:
- Around line 37-48: Validate that every slide_id in slide_ids exists in
slides_positivity before computing labels and handle missing keys explicitly
(either by raising a clear error referencing slides_positivity and slide_id or
by assigning a default positivity value). After computing labels (using
slides_positivity and positivity_thr), guard the subsequent class counts
positive and negative against zero to avoid division by zero: if one class is
absent, assign sensible fallback weights (e.g., uniform weights or all mass to
the present class) instead of dividing by zero. Ensure these checks are applied
around the computations of labels, positive, negative, and weights so
weights[labels==0] and weights[labels==1] are always safe to compute.

In `@nuclei_graph/data/utils/filtering.py`:
- Around line 14-17: Replace the direct print call that reports dropped slides
with structured logging: add a module-level logger (e.g., logger =
logging.getLogger(__name__)) and change the print(...) to logger.info(...) using
a formatted message and parameters (include min_count and
dropped_slides.to_string(index=False) in the log). Ensure the logging module is
imported and that the log message is concise and structured rather than using
print, so dropped_slides and min_count are emitted via logger.info instead of
print.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0050cd6e-46c6-4045-be66-019f3c0c9170

📥 Commits

Reviewing files that changed from the base of the PR and between 571a718 and 58f3c82.

📒 Files selected for processing (12)
  • configs/data/samplers/weighted_random_sampler.yaml
  • configs/experiment/modeling/training/base.yaml
  • configs/experiment/modeling/training/crop_level.yaml
  • configs/experiment/modeling/training/nuclei_level.yaml
  • configs/ml.yaml
  • nuclei_graph/data/data_module.py
  • nuclei_graph/data/samplers/__init__.py
  • nuclei_graph/data/samplers/weighted_random_sampler.py
  • nuclei_graph/data/utils/__init__.py
  • nuclei_graph/data/utils/collator.py
  • nuclei_graph/data/utils/filtering.py
  • nuclei_graph/nuclei_graph_typing.py

Comment thread configs/data/samplers/weighted_random_sampler.yaml
Comment thread configs/experiment/modeling/training/base.yaml Outdated
Comment thread configs/experiment/modeling/training/crop_level.yaml Outdated
Comment thread configs/experiment/modeling/training/nuclei_level.yaml Outdated
Comment thread nuclei_graph/data/data_module.py Outdated
Comment thread nuclei_graph/data/utils/collator.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nuclei_graph/data/data_module.py (1)

138-151: ⚠️ Potential issue | 🔴 Critical

Fail fast when filtering removes the training split.

Line 138 and Line 149 can both reduce train_df to zero rows. The code then continues into supervision/dataset construction, and drop_last=True later turns that into a zero-batch epoch instead of a clear setup error.

🚨 Proposed guard
                 train_df = min_count_filter(train_df, self.dataset_cfg.crop_size)
+                if train_df.empty:
+                    raise ValueError(
+                        "No training slides remain after min_count_filter; "
+                        f"crop_size={self.dataset_cfg.crop_size}."
+                    )
                 train_sup_dfs = self._load_sup_sources(self.train_strategy)
                 train_sup = self._prepare_supervision(
                     train_df, train_sup_dfs, self.train_strategy
                 )
                 self.positivity = train_sup.positivity_map
 
                 if self.dataset_cfg.mil:
                     min_pos_count = (
                         self.dataset_cfg.crop_size * self.dataset_cfg.crop_pos_thr
                     )
                     train_df = min_positive_count_filter(
                         train_df, min_pos_count, train_sup.pos_count_map
                     )
+                    if train_df.empty:
+                        raise ValueError(
+                            "No training slides remain after min_positive_count_filter; "
+                            f"min_pos_count={min_pos_count}."
+                        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/data_module.py` around lines 138 - 151, After each
filtering step that can drop all training rows (after min_count_filter(train_df,
self.dataset_cfg.crop_size) and after min_positive_count_filter(...)),
immediately check if train_df is empty and raise a clear exception (e.g.,
ValueError) describing which filter removed the split and include relevant
context (dataset_cfg.crop_size, dataset_cfg.crop_pos_thr, and train_strategy);
do this before calling _load_sup_sources, _prepare_supervision or using
train_sup/positivity_map so the pipeline fails fast instead of producing
zero-batch epochs.
🧹 Nitpick comments (1)
nuclei_graph/data/samplers/weighted_random_sampler.py (1)

18-27: Document the actual weighting behavior.

This sampler is not doing plain inverse-frequency weighting, and replacement is not optional in the current signature. The docstring should describe the pos_slide_ratio contract explicitly so experiment configs are tuned against the real behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/samplers/weighted_random_sampler.py` around lines 18 - 27,
Update the class/function docstring for WeightedRandomSampler (and its __init__
signature) to state the exact weighting behavior used (referencing the internal
method that computes weights, e.g., any compute_weights or weight assignment
logic) — explicitly describe the formula (for example: per-sample weight = 1 /
freq(class) or if you normalize across slides or use slide-level positivity to
scale sample weights, state that exact computation), clarify that replacement is
currently required/not optional in the implementation (match the actual flag
behavior in __init__), and spell out the pos_slide_ratio contract precisely
(what range it accepts, whether it is applied to slide-level selection vs
sample-level weighting, and whether it is a hard quota or target ratio). Make
the docstring changes in WeightedRandomSampler and its __init__ so experiment
configs can be tuned to the real behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nuclei_graph/data/samplers/weighted_random_sampler.py`:
- Around line 42-50: Validate pos_slide_ratio is within [0,1] and guard the
divisions by ensuring both class counts (variables positive and negative
computed from labels) are greater than zero before computing weights; if
positive==0 or negative==0 or pos_slide_ratio is out of range, raise a clear
ValueError describing the bad configuration (mentioning pos_slide_ratio,
positive, negative, slide_ids/slides_positivity) instead of proceeding to
compute weights[labels == 0] and weights[labels == 1]; update the logic around
labels, positive, negative, pos_slide_ratio and weights to perform these checks
early and fail fast with a descriptive message.

---

Duplicate comments:
In `@nuclei_graph/data/data_module.py`:
- Around line 138-151: After each filtering step that can drop all training rows
(after min_count_filter(train_df, self.dataset_cfg.crop_size) and after
min_positive_count_filter(...)), immediately check if train_df is empty and
raise a clear exception (e.g., ValueError) describing which filter removed the
split and include relevant context (dataset_cfg.crop_size,
dataset_cfg.crop_pos_thr, and train_strategy); do this before calling
_load_sup_sources, _prepare_supervision or using train_sup/positivity_map so the
pipeline fails fast instead of producing zero-batch epochs.

---

Nitpick comments:
In `@nuclei_graph/data/samplers/weighted_random_sampler.py`:
- Around line 18-27: Update the class/function docstring for
WeightedRandomSampler (and its __init__ signature) to state the exact weighting
behavior used (referencing the internal method that computes weights, e.g., any
compute_weights or weight assignment logic) — explicitly describe the formula
(for example: per-sample weight = 1 / freq(class) or if you normalize across
slides or use slide-level positivity to scale sample weights, state that exact
computation), clarify that replacement is currently required/not optional in the
implementation (match the actual flag behavior in __init__), and spell out the
pos_slide_ratio contract precisely (what range it accepts, whether it is applied
to slide-level selection vs sample-level weighting, and whether it is a hard
quota or target ratio). Make the docstring changes in WeightedRandomSampler and
its __init__ so experiment configs can be tuned to the real behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf30dd6d-015e-4f26-ae76-88d2ddb2cd1a

📥 Commits

Reviewing files that changed from the base of the PR and between 58f3c82 and 0d4b884.

📒 Files selected for processing (10)
  • configs/data/samplers/weighted_random_sampler.yaml
  • configs/experiment/modeling/training/base.yaml
  • configs/experiment/modeling/training/crop_level.yaml
  • configs/experiment/modeling/training/nuclei_level.yaml
  • configs/ml.yaml
  • nuclei_graph/data/data_module.py
  • nuclei_graph/data/samplers/weighted_random_sampler.py
  • nuclei_graph/data/utils/__init__.py
  • nuclei_graph/data/utils/collator.py
  • nuclei_graph/data/utils/filtering.py
✅ Files skipped from review due to trivial changes (3)
  • configs/data/samplers/weighted_random_sampler.yaml
  • configs/experiment/modeling/training/nuclei_level.yaml
  • configs/experiment/modeling/training/crop_level.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • configs/ml.yaml
  • nuclei_graph/data/utils/init.py
  • configs/experiment/modeling/training/base.yaml
  • nuclei_graph/data/utils/collator.py
  • nuclei_graph/data/utils/filtering.py

Comment thread nuclei_graph/data/samplers/weighted_random_sampler.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
visualization/polygons2raster.py (1)

148-151: Consider specifying keep parameter in drop_duplicates for clarity.

The drop_duplicates(subset=["slide_path"]) call defaults to keep='first', which may silently discard rows if the same slide appears in multiple metadata URIs with different slide_nuclei_path values. If this is the intended behavior, adding an explicit keep='first' would improve readability. Alternatively, if duplicates are unexpected, consider raising an error or logging a warning.

💡 Optional: Make deduplication behavior explicit
 def uris2df(uris: list[str]) -> pd.DataFrame:
     """Loads and merges multiple metadata Parquet files into a single DataFrame."""
     batches = [pd.read_parquet(download_artifacts(uri)) for uri in uris]
-    return pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"])
+    return pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"], keep="first")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@visualization/polygons2raster.py` around lines 148 - 151, The uris2df
function currently calls drop_duplicates(subset=["slide_path"]) which relies on
the default keep='first' implicitly; update uris2df to make deduplication
behavior explicit by specifying keep='first' if keeping the first occurrence is
intended, or change to keep='last' as appropriate, or instead detect duplicate
slide_path values after concatenation and either raise an exception or log a
warning (using the project logger) so mismatched slide_nuclei_path values are
surfaced; refer to the uris2df function and its drop_duplicates call to
implement the chosen explicit behavior.
preprocessing/data_split.py (1)

31-33: Consider using .loc or .copy() to avoid potential SettingWithCopyWarning.

Assigning to train["set"] and test["set"] directly may trigger a SettingWithCopyWarning depending on the pandas version and how train_test_split returns the DataFrames. Using .loc or explicitly calling .copy() on the split results would make the intent clearer.

💡 Optional: Explicit copy to avoid warnings
     train, test = train_test_split(
         slides,
         test_size=config.test_size,
         random_state=42,
         stratify=slides[config.stratify_column],
     )
-    train["set"] = "train"
-    test["set"] = "test"
+    train = train.copy()
+    test = test.copy()
+    train["set"] = "train"
+    test["set"] = "test"
     split = pd.concat([train, test], ignore_index=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/data_split.py` around lines 31 - 33, The direct assignments
train["set"] = "train" and test["set"] = "test" can trigger pandas
SettingWithCopyWarning; to fix, make the split DataFrames explicit copies (e.g.,
call copy() on the results returned into train and test) or assign via .loc
(e.g., train.loc[:, "set"] = "train" and test.loc[:, "set"] = "test") before
concatenating into split so the operations modify bona fide DataFrames rather
than potential views.
exploration/panda/save_metadataset.py (1)

34-37: Serialize error logging instead of appending from every worker.

All Ray tasks share the same errors.log path and write to it concurrently. That can interleave or lose log lines under parallel execution, especially on network-backed storage. Prefer collecting error strings in the driver or routing writes through a dedicated actor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 34 - 37, The current log
function (def log(msg: str)) appends directly to the shared errors.log from
every Ray worker causing interleaved/lost lines; change workers to stop writing
files directly and either (a) return error strings from the Ray tasks to the
driver and perform single-threaded writes to errors.log in the driver, or (b)
create a dedicated Ray actor (e.g., class ErrorLogger with a log(self, msg: str)
method) that serializes writes and call error_logger.log.remote(msg) from
workers; update all usages of log(...) in the code to instead return errors or
call the actor so only one process performs file I/O.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 51-60: The current use of tifffile.imread(materializes the whole
raster) should be replaced by a lightweight validation: open the file with
tifffile.TiffFile in a context manager (instead of tifffile.imread), check that
the file has at least one page/series and that the first page/series reports a
non-zero shape/size (e.g., pages or series metadata) without calling .asarray()
or reading pixel data, and preserve the same exception handling and log messages
(referencing mask_path, tifffile.TiffFile, and log). If the file has no pages or
reports zero pixels treat it as MASK_EMPTY and return False; on any exception
log MASK_CORRUPTED with the exception string and return False.

In `@preprocessing/metadata_mapping/panda.py`:
- Around line 35-45: The lambda in nuclei_paths is using parameter name "id"
which shadows Python's built-in id(); change the parameter to a descriptive name
(e.g., "sid") in the nuclei_paths map call (the expression nuclei_paths =
slides["slide_id"].map(lambda id: nuclei_dir / f"slide_id={id}")) and update any
corresponding references in that lambda to use the new name so the behavior of
nuclei_paths and subsequent map_df entries ("slide_id", "slide_nuclei_path")
remains unchanged.

---

Nitpick comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 34-37: The current log function (def log(msg: str)) appends
directly to the shared errors.log from every Ray worker causing interleaved/lost
lines; change workers to stop writing files directly and either (a) return error
strings from the Ray tasks to the driver and perform single-threaded writes to
errors.log in the driver, or (b) create a dedicated Ray actor (e.g., class
ErrorLogger with a log(self, msg: str) method) that serializes writes and call
error_logger.log.remote(msg) from workers; update all usages of log(...) in the
code to instead return errors or call the actor so only one process performs
file I/O.

In `@preprocessing/data_split.py`:
- Around line 31-33: The direct assignments train["set"] = "train" and
test["set"] = "test" can trigger pandas SettingWithCopyWarning; to fix, make the
split DataFrames explicit copies (e.g., call copy() on the results returned into
train and test) or assign via .loc (e.g., train.loc[:, "set"] = "train" and
test.loc[:, "set"] = "test") before concatenating into split so the operations
modify bona fide DataFrames rather than potential views.

In `@visualization/polygons2raster.py`:
- Around line 148-151: The uris2df function currently calls
drop_duplicates(subset=["slide_path"]) which relies on the default keep='first'
implicitly; update uris2df to make deduplication behavior explicit by specifying
keep='first' if keeping the first occurrence is intended, or change to
keep='last' as appropriate, or instead detect duplicate slide_path values after
concatenation and either raise an exception or log a warning (using the project
logger) so mismatched slide_nuclei_path values are surfaced; refer to the
uris2df function and its drop_duplicates call to implement the chosen explicit
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0e9e280-5666-41f6-9ff5-39b591a1f7ef

📥 Commits

Reviewing files that changed from the base of the PR and between d48b8ed and cbfbb15.

📒 Files selected for processing (45)
  • configs/base.yaml
  • configs/data/sources/panda.yaml
  • configs/data/sources/prostate_cancer.yaml
  • configs/data/sources/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/annotation_labels/panda.yaml
  • configs/experiment/preprocessing/cam_labels/annot_restricted_thr/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/cam_labels/default_thr/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/cam_masks/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/data_split/radboud.yaml
  • configs/experiment/preprocessing/metadata_mapping/radboud.yaml
  • configs/experiment/preprocessing/unipolar_heatmap_labels/annotation/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/unipolar_heatmap_labels/prediction/virchow2/prostate_cancer_mmci_tl.yaml
  • configs/experiment/visualization/annotation_labels.yaml
  • configs/exploration/panda/save_metadataset.yaml
  • configs/exploration/prostate_cancer_mmci_tl/save_metadataset.yaml
  • configs/preprocessing/annotation_labels.yaml
  • configs/preprocessing/data_split.yaml
  • configs/preprocessing/metadata_mapping/panda.yaml
  • configs/preprocessing/metadata_mapping/prostate_cancer_mmci_tl.yaml
  • configs/preprocessing/nuclei_standardization.yaml
  • configs/visualization/polygons2raster/prostate_cancer_mmci_tl.yaml
  • configs/visualization/polygons2raster/radboud.yaml
  • exploration/__init__.py
  • exploration/panda/__init__.py
  • exploration/panda/save_metadataset.py
  • exploration/prostate_cancer_mmci_tl/__init__.py
  • exploration/prostate_cancer_mmci_tl/save_metadataset.py
  • preprocessing/__init__.py
  • preprocessing/annotation_labels.py
  • preprocessing/data_split.py
  • preprocessing/metadata_mapping/__init__.py
  • preprocessing/metadata_mapping/panda.py
  • preprocessing/metadata_mapping/prostate_cancer_mmci_tl.py
  • preprocessing/nuclei_standardization.py
  • scripts/exploration/panda/__init__.py
  • scripts/exploration/panda/run_save_metadataset.py
  • scripts/exploration/prostate_cancer_mmci_tl/__init__.py
  • scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py
  • scripts/preprocessing/metadata_mapping/run_panda.py
  • scripts/preprocessing/metadata_mapping/run_prostate_cancer_mmci_tl.py
  • scripts/preprocessing/run_annotation_labels.py
  • scripts/preprocessing/run_data_split.py
  • scripts/preprocessing/run_nuclei_standardization.py
  • scripts/visualization/run_polygons_rasterization.py
  • visualization/polygons2raster.py
💤 Files with no reviewable changes (2)
  • configs/experiment/visualization/annotation_labels.yaml
  • configs/data/sources/prostate_cancer.yaml
✅ Files skipped from review due to trivial changes (16)
  • configs/experiment/preprocessing/unipolar_heatmap_labels/prediction/virchow2/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/cam_masks/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/cam_labels/annot_restricted_thr/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/cam_labels/default_thr/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/unipolar_heatmap_labels/annotation/prostate_cancer_mmci_tl.yaml
  • configs/experiment/preprocessing/metadata_mapping/radboud.yaml
  • configs/data/sources/panda.yaml
  • configs/preprocessing/nuclei_standardization.yaml
  • configs/experiment/preprocessing/data_split/radboud.yaml
  • configs/base.yaml
  • configs/experiment/preprocessing/annotation_labels/panda.yaml
  • configs/exploration/panda/save_metadataset.yaml
  • configs/preprocessing/data_split.yaml
  • configs/visualization/polygons2raster/radboud.yaml
  • configs/preprocessing/metadata_mapping/panda.yaml
  • configs/data/sources/prostate_cancer_mmci_tl.yaml

Comment thread exploration/panda/save_metadataset.py
Comment thread preprocessing/metadata_mapping/panda.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
nuclei_graph/data/data_module.py (2)

133-143: Consider making random_state configurable.

The train/val split seed is hardcoded to 42. For reproducibility control and for sweeps/ablations where a different seed is needed, exposing this as a constructor/config argument (e.g., split_seed: int = 42) is generally preferable and aligns with the other split-related parameters already plumbed through (split_stratify_col, split_group_col, split_size).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/data_module.py` around lines 133 - 143, Add a configurable
seed parameter (e.g., split_seed: int = 42) to the DataModule constructor and
store it as self.split_seed, then replace the hardcoded random_state=42 in the
train_test_split call with random_state=self.split_seed; update any
docstring/parameter list for the class to document split_seed and ensure
train_test_split (the call using slides_df, self.split_size,
self.split_stratify_col, and self.split_group_col) uses that stored value for
reproducible and configurable splits.

227-245: Inconsistent pin_memory across eval dataloaders.

val_dataloader sets pin_memory=True but test_dataloader and predict_dataloader do not. If the omission is deliberate, ignore; otherwise align them for consistent host→device transfer behavior.

Proposed alignment
     def test_dataloader(self) -> Iterable[Batch]:
         return DataLoader(
             self.test_dataset,
             batch_size=1,
             num_workers=self.eval_num_workers,
             persistent_workers=self.eval_num_workers > 0,
             prefetch_factor=2 if self.eval_num_workers > 0 else None,
+            pin_memory=True,
             collate_fn=collate_fn,
         )

     def predict_dataloader(self) -> Iterable[PredictBatch]:
         return DataLoader(
             self.predict_dataset,
             batch_size=1,
             num_workers=self.eval_num_workers,
             persistent_workers=self.eval_num_workers > 0,
             prefetch_factor=2 if self.eval_num_workers > 0 else None,
+            pin_memory=True,
             collate_fn=collate_fn_predict,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/data_module.py` around lines 227 - 245, The test_dataloader
and predict_dataloader are missing the pin_memory setting present in
val_dataloader; update both DataLoader constructions in test_dataloader and
predict_dataloader to include pin_memory=True (or use the same eval pin_memory
attribute if one exists) so host→device transfer behavior is consistent with
val_dataloader.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nuclei_graph/data/data_module.py`:
- Around line 133-143: Add a configurable seed parameter (e.g., split_seed: int
= 42) to the DataModule constructor and store it as self.split_seed, then
replace the hardcoded random_state=42 in the train_test_split call with
random_state=self.split_seed; update any docstring/parameter list for the class
to document split_seed and ensure train_test_split (the call using slides_df,
self.split_size, self.split_stratify_col, and self.split_group_col) uses that
stored value for reproducible and configurable splits.
- Around line 227-245: The test_dataloader and predict_dataloader are missing
the pin_memory setting present in val_dataloader; update both DataLoader
constructions in test_dataloader and predict_dataloader to include
pin_memory=True (or use the same eval pin_memory attribute if one exists) so
host→device transfer behavior is consistent with val_dataloader.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe33de16-622f-4163-b791-35f74a9528ea

📥 Commits

Reviewing files that changed from the base of the PR and between d48b8ed and ecdb26e.

📒 Files selected for processing (4)
  • configs/experiment/modeling/training/base.yaml
  • configs/ml.yaml
  • nuclei_graph/data/data_module.py
  • nuclei_graph/data/samplers/weighted_random_sampler.py
✅ Files skipped from review due to trivial changes (1)
  • configs/experiment/modeling/training/base.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • configs/ml.yaml

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nuclei_graph/data/data_module.py (1)

127-130: Use explicit runtime validation instead of assert.

These checks protect required configuration and loaded data. While assertions function during normal execution, they provide no actionable error messages and become problematic when Python is run with optimization flags (-O), which strip assertions entirely. Replacing them with explicit exceptions makes failures more debuggable.

Suggested fix
             case "fit" | "validate":
-                assert self.split_size is not None
+                if self.split_size is None:
+                    raise ValueError("`split_size` must be configured for fit/validate setup.")
                 slides_df = self._load_df(slides_uri)
-                assert slides_df is not None
+                if slides_df is None:
+                    raise ValueError("Training metadata URI must be configured for fit/validate setup.")
 
                 train_df, validation_df = train_test_split(
@@
                 if stage == "fit":
-                    assert self.train_strategy is not None
+                    if self.train_strategy is None:
+                        raise ValueError("`supervision.train_strategy` must be configured for fit setup.")

Also applies to: 146-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nuclei_graph/data/data_module.py` around lines 127 - 130, Replace the runtime
assertions with explicit validation and informative exceptions: instead of using
assert self.split_size is not None and assert slides_df is not None in the case
"fit" | "validate" block (and the similar asserts at the later occurrence around
lines 146-148), check the conditions and raise ValueError or RuntimeError with
clear messages (e.g., "split_size must be set for fit/validate" and "failed to
load slides dataframe from {slides_uri}") so failures are never stripped by -O
and provide actionable diagnostics; locate these checks around the method that
calls self._load_df and update them to validate self.split_size and the returned
slides_df before proceeding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nuclei_graph/data/data_module.py`:
- Around line 178-183: The code currently treats "test" and "predict" the same
by calling _load_df with METADATA_COLS_EVAL and then _prepare_supervision;
change this so "test" retains the supervised flow (keep _load_df(...,
cols=METADATA_COLS_EVAL) and call _prepare_supervision(slides_df,
self.eval_strategy.paths, self.eval_strategy)), while "predict" loads a
prediction-only metadata column set (e.g. METADATA_COLS_PRED or a subset that
omits is_carcinoma) via _load_df(slides_uri, cols=...) and does NOT call
_prepare_supervision; instead create the PredictBatch/Metadata using that
label-free slides_df (so PredictBatch uses Metadata without is_carcinoma).
Ensure slides_df is still checked for None in both branches and reference the
existing methods/objects: _load_df, METADATA_COLS_EVAL, _prepare_supervision,
PredictBatch, Metadata, and self.eval_strategy.

---

Nitpick comments:
In `@nuclei_graph/data/data_module.py`:
- Around line 127-130: Replace the runtime assertions with explicit validation
and informative exceptions: instead of using assert self.split_size is not None
and assert slides_df is not None in the case "fit" | "validate" block (and the
similar asserts at the later occurrence around lines 146-148), check the
conditions and raise ValueError or RuntimeError with clear messages (e.g.,
"split_size must be set for fit/validate" and "failed to load slides dataframe
from {slides_uri}") so failures are never stripped by -O and provide actionable
diagnostics; locate these checks around the method that calls self._load_df and
update them to validate self.split_size and the returned slides_df before
proceeding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 766271fc-9cfb-4066-89fc-14a992b6bd79

📥 Commits

Reviewing files that changed from the base of the PR and between ecdb26e and 4f52159.

📒 Files selected for processing (1)
  • nuclei_graph/data/data_module.py

Comment thread nuclei_graph/data/data_module.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant