feat: supervision#6
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 56 minutes and 37 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughAdds a new slide- and dataset-level supervision subsystem (configs + runtime factory) and new Hydra configurations for multiple supervision modes and training experiments; implements NucleiSupervision classes, a SupervisionStrategy factory, and build_supervision to produce DatasetSupervision mappings. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Builder as build_supervision()
participant Strategy as SupervisionStrategy
participant Nuclei as NucleiSupervision
participant Dataset as DatasetSupervision
User->>Builder: provide sup_dfs, carcinoma_map, strategy
Builder->>Builder: merge/join supervision DataFrames by slide_id
loop per slide
Builder->>Strategy: create(is_carcinoma, available label tensors)
Strategy->>Nuclei: instantiate appropriate NucleiSupervision subclass
Nuclei-->>Strategy: return supervision instance
Strategy-->>Builder: supervision instance
Builder->>Builder: wrap into SlideSupervision
end
Builder->>Dataset: assemble SlideSupervision map
Dataset-->>Builder: DatasetSupervision
Builder-->>User: return DatasetSupervision
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/gemini summary |
Summary of ChangesThis pull request introduces a flexible and extensible supervision system for nucleus-level data within the Highlights
Changelog
Activity
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
configs/ml.yaml (1)
13-22: Consider clarifying the field override hierarchy.The
datasection defines several fields that are also present in the experiment configs (e.g.,batch_size,dataset,supervision.eval_strategy). Given the defaults composition order (line 4 placesnuclei_levelafter_self_), values fromnuclei_level.yamlwill override these placeholders. While this is valid Hydra behavior, the duplication might confuse users.Consider either:
- Adding a comment explaining the override hierarchy
- Consolidating required fields to avoid duplication
- Using more specific placeholder names to clarify which configs should provide values
For example,
nuclei_level.yamlsetsbatch_size: 32, which will overridebatch_size: ???defined here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@configs/ml.yaml` around lines 13 - 22, The data section currently repeats fields (data.batch_size, data.dataset, data.supervision.eval_strategy) that are also provided by included level configs (e.g., nuclei_level.yaml) and this can be confusing given the defaults composition order that places nuclei_level after _self_. Update configs to clarify the override hierarchy by either adding a concise comment in configs/ml.yaml explaining that nuclei_level.yaml will override these placeholders, consolidating/removing duplicated placeholders here so only one source supplies values, or renaming the placeholders to explicit names (e.g., data.batch_size_default or data.batch_size_override) so it's clear which file should supply the final value; reference data.batch_size, data.dataset, data.supervision.eval_strategy, nuclei_level.yaml and the _self_ inclusion when making the change.configs/experiment/modeling/training/nuclei_level.yaml (1)
12-20: Ensure TODO placeholders are addressed before production use.The config contains multiple TODO sections (
trainer,dataset,sampler,model) that must be populated for the experiment to run. This same pattern appears across multiple training experiment configs (crop_level.yaml,base.yaml), indicating systemic scaffolding. No documentation was found to guide completion of these placeholders.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@configs/experiment/modeling/training/nuclei_level.yaml` around lines 12 - 20, The YAML contains unfilled TODO placeholders (trainer, dataset, sampler, model) that will prevent runs; replace each TODO with concrete config entries: for trainer fill optimizer, learning_rate, epochs, device, checkpointing and seed (e.g., optimizer name, lr schedule, max_epochs), for dataset replace dataset with dataset_name/path, split strategy and transforms/preprocessing, for sampler replace sampler with sampler_type and any params (e.g., weighted, random, batch_sampler settings), and for model specify architecture, pretrained flag, num_classes and any head/hyperparameters; apply the same fixes to crop_level.yaml and base.yaml, validate each file against the training schema or config loader, and add/update short docs/comments describing required keys and example values so future PRs don't reintroduce TODOs.nuclei_graph/data/supervision.py (1)
197-209: Inner join may silently drop slides with partial supervision data.The
reducewith inner merge means only slides present in all provided DataFrames will be included. If a slide has annotation labels but missing CAM labels, it will be silently excluded. Consider whether anouterjoin with explicit handling of missing columns would be more appropriate, or at minimum log which slides were dropped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nuclei_graph/data/supervision.py` around lines 197 - 209, The current reduce+pd.merge in sup_groups uses how="inner", which silently drops slides not present in every DataFrame (sources from sup_dfs); change the merge strategy to how="outer" (or perform pairwise outer merges) to preserve slides with partial supervision, then explicitly handle missing columns/rows (e.g., fillna, add boolean flags, or filter later) and log which slide_ids were dropped or have missing supervision types; update references to sources/sup_dfs/sup_groups and the reduce(pd.merge(..., how="inner")) call accordingly so downstream code can detect and handle NaNs rather than losing rows silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@configs/experiment/modeling/training/base.yaml`:
- Around line 14-15: The supervision block only declares train_strategy
(supervision.train_strategy: ???) but other code expects
supervision.eval_strategy; add a symmetric placeholder by declaring
supervision.eval_strategy: ??? so missing eval_strategy fails fast and clearly
during config composition—update the YAML block that defines supervision (the
train_strategy entry and add a new eval_strategy entry) to include the eval
placeholder.
In `@nuclei_graph/data/supervision.py`:
- Around line 174-176: Annotate the __init__ to satisfy mypy and clarify uris
usage: change def __init__(self, mode: str, **uris: str) -> None and add a typed
attribute assignment like self.uris: dict[str, str] = uris; if the stored uris
are intentionally unused, either remove the assignment entirely or rename to
self._uris: dict[str, str] = uris (or add a # noqa / # unused comment) to signal
future use. Ensure you import any typing primitives if needed for older Python
versions.
- Around line 180-188: Update the create signature to annotate the kwargs so
mypy knows the types of label lists: change def create(self, is_carcinoma: bool,
**all_labels) -> NucleiSupervision: to def create(self, is_carcinoma: bool,
**all_labels: Sequence[Any]) -> NucleiSupervision: and add the necessary typing
imports (Sequence, Any) at the top; keep the rest of the body the same
(filtered_labels usage will then be typed as a dict of sequences).
- Around line 220-224: The code calls sup_groups.get_group(slide_id) which
raises KeyError for carcinoma slides absent from the merged supervision groups;
to fix, guard that call by checking membership (e.g., if slide_id in
sup_groups.groups:) or wrap get_group in try/except KeyError, and on missing
group skip populating COL_TO_KWARG entries (or set explicit default tensors) and
optionally log a warning; update the block around is_carcinoma to use this check
and only iterate COL_TO_KWARG when the group exists to avoid the crash.
---
Nitpick comments:
In `@configs/experiment/modeling/training/nuclei_level.yaml`:
- Around line 12-20: The YAML contains unfilled TODO placeholders (trainer,
dataset, sampler, model) that will prevent runs; replace each TODO with concrete
config entries: for trainer fill optimizer, learning_rate, epochs, device,
checkpointing and seed (e.g., optimizer name, lr schedule, max_epochs), for
dataset replace dataset with dataset_name/path, split strategy and
transforms/preprocessing, for sampler replace sampler with sampler_type and any
params (e.g., weighted, random, batch_sampler settings), and for model specify
architecture, pretrained flag, num_classes and any head/hyperparameters; apply
the same fixes to crop_level.yaml and base.yaml, validate each file against the
training schema or config loader, and add/update short docs/comments describing
required keys and example values so future PRs don't reintroduce TODOs.
In `@configs/ml.yaml`:
- Around line 13-22: The data section currently repeats fields (data.batch_size,
data.dataset, data.supervision.eval_strategy) that are also provided by included
level configs (e.g., nuclei_level.yaml) and this can be confusing given the
defaults composition order that places nuclei_level after _self_. Update configs
to clarify the override hierarchy by either adding a concise comment in
configs/ml.yaml explaining that nuclei_level.yaml will override these
placeholders, consolidating/removing duplicated placeholders here so only one
source supplies values, or renaming the placeholders to explicit names (e.g.,
data.batch_size_default or data.batch_size_override) so it's clear which file
should supply the final value; reference data.batch_size, data.dataset,
data.supervision.eval_strategy, nuclei_level.yaml and the _self_ inclusion when
making the change.
In `@nuclei_graph/data/supervision.py`:
- Around line 197-209: The current reduce+pd.merge in sup_groups uses
how="inner", which silently drops slides not present in every DataFrame (sources
from sup_dfs); change the merge strategy to how="outer" (or perform pairwise
outer merges) to preserve slides with partial supervision, then explicitly
handle missing columns/rows (e.g., fillna, add boolean flags, or filter later)
and log which slide_ids were dropped or have missing supervision types; update
references to sources/sup_dfs/sup_groups and the reduce(pd.merge(...,
how="inner")) call accordingly so downstream code can detect and handle NaNs
rather than losing rows silently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: db751d6d-0769-4759-9de0-ca7d218d8bbf
📒 Files selected for processing (11)
configs/data/supervision/agreement.yamlconfigs/data/supervision/annotation.yamlconfigs/data/supervision/base.yamlconfigs/data/supervision/cam.yamlconfigs/data/supervision/prediction.yamlconfigs/experiment/modeling/training/base.yamlconfigs/experiment/modeling/training/crop_level.yamlconfigs/experiment/modeling/training/nuclei_level.yamlconfigs/ml.yamlnuclei_graph/data/__init__.pynuclei_graph/data/supervision.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
nuclei_graph/data/supervision.py (2)
197-198: Consider using explicit exception for empty sources validation.Same as above—
assertcan be disabled with-O. An emptysourceslist indicates a configuration error that should fail reliably.♻️ Proposed fix
sources = [df for df in sup_dfs.values() if df is not None] - assert sources + if not sources: + raise ValueError("At least one supervision DataFrame must be provided")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nuclei_graph/data/supervision.py` around lines 197 - 198, Replace the runtime-unsafe assert with an explicit check that raises a clear exception: after building sources = [df for df in sup_dfs.values() if df is not None], add an if not sources: raise ValueError(...) (or a custom ConfigurationError) with a descriptive message mentioning sup_dfs/sources so the configuration error fails reliably even when Python is run with optimizations disabled.
184-186: Consider using explicit exception instead ofassertfor runtime validation.
assertstatements can be disabled when Python runs with optimization flags (-Oor-OO). For carcinoma slides, missing required labels is a data integrity issue that should fail reliably.♻️ Proposed fix
if is_carcinoma: for k, v in filtered_labels.items(): - assert len(v) > 0, f"Missing required label {k} for slide" + if len(v) == 0: + raise ValueError(f"Missing required label '{k}' for carcinoma slide")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nuclei_graph/data/supervision.py` around lines 184 - 186, Replace the runtime use of assert in the carcinoma-label check with an explicit exception so failures can't be skipped under Python optimizations: in the block that checks is_carcinoma and iterates filtered_labels (the loop using for k, v in filtered_labels.items()), change the assertion to raise a clear exception (e.g., ValueError or a custom DataValidationError) that includes the missing label name and context about the slide so the validation always fails reliably at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nuclei_graph/data/supervision.py`:
- Around line 197-198: Replace the runtime-unsafe assert with an explicit check
that raises a clear exception: after building sources = [df for df in
sup_dfs.values() if df is not None], add an if not sources: raise
ValueError(...) (or a custom ConfigurationError) with a descriptive message
mentioning sup_dfs/sources so the configuration error fails reliably even when
Python is run with optimizations disabled.
- Around line 184-186: Replace the runtime use of assert in the carcinoma-label
check with an explicit exception so failures can't be skipped under Python
optimizations: in the block that checks is_carcinoma and iterates
filtered_labels (the loop using for k, v in filtered_labels.items()), change the
assertion to raise a clear exception (e.g., ValueError or a custom
DataValidationError) that includes the missing label name and context about the
slide so the validation always fails reliably at runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5182ffe5-2ba7-44e3-a529-d3b3e392474c
📒 Files selected for processing (1)
nuclei_graph/data/supervision.py
Summary by CodeRabbit
New Features
Configuration