Feature/update docs#45
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on enhancing the documentation for constructing and efficiently loading tiled datasets, particularly in large-scale computer vision applications like digital pathology. It refactors the explanation of data loading pipelines, clarifies the roles of different dataset components, and introduces a performance optimization for filtering tile metadata, ensuring that the system can handle massive datasets without memory issues. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReorganized the tiled-dataset quick-start and refactored the SlideDataset example to persist HF slide/tile datasets, precompute a tile index using pyarrow, add a slice-based Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request significantly improves the documentation for loading tiled datasets by introducing a much more efficient data loading strategy. The new approach, which involves pre-sorting the tile metadata and building an index, is a great enhancement. However, I found a critical bug in the implementation of the index-building logic (_build_tile_index) in the example code. The use of pc.value_counts is incorrect as it doesn't guarantee order, which can lead to silent data corruption. I've provided a corrected implementation using pc.run_end_encode. The other changes are minor docstring cleanups which look good.
There was a problem hiding this comment.
Pull request overview
Updates documentation/examples around tiled dataset loading and cleans up Ray tensor aggregator docstrings.
Changes:
- Refactors
tiled-dataset.mdto better explainTileDataset/SlideDatasetand introduces an index-based tile lookup example. - Adds a
pyarrow.compute-based_build_tile_indexexample to avoid per-slidefilter(...)lambdas. - Removes stray blank
>>>lines fromTensorMean/TensorStddocstring examples.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ratiopath/ray/aggregate/tensor_std.py |
Docstring doctest formatting cleanup (removes empty prompts). |
ratiopath/ray/aggregate/tensor_mean.py |
Docstring doctest formatting cleanup (removes empty prompts). |
docs/learn/get-started/quick-start/tiled-dataset.md |
Reworks quick-start narrative and updates example code to use a precomputed tile index for per-slide selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Line 80: The docs still describe per-slide streaming/filtering but the
implementation now uses SlideDataset with a precomputed index, sort, and
index-based selection; update the wording (the sentences that mention Smart
Caching / per-slide filtering and dynamic streaming) to explain that
SlideDataset precomputes an index, sorts tiles, and then you create a view with
select(range) (or select by index range) which is cached to disk, rather than
streaming and filtering per-slide at read time; mention SlideDataset, sort,
precomputed index, and select(range) explicitly so readers know how the
cache/view is produced.
- Around line 146-157: The current logic uses pc.value_counts(slide_ids)
(assigned to counts) and then accumulates offsets into index_map using
current_offset, but value_counts has unspecified ordering so ranges can be
wrong; instead iterate the slide_ids sequence in order (or use a sorted copy if
needed) to detect contiguous runs of the same slide id and build index_map by
recording start and end indices for each run (using the same index positions as
in slide_ids) rather than relying on value_counts output; update the code
referencing counts, index_map, current_offset, and slide_ids to produce ranges
from sequential traversal of slide_ids so tiles map to the correct slide ids.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8ccd72f5-22d1-4c73-a73b-e0e5d4eed790
📒 Files selected for processing (3)
docs/learn/get-started/quick-start/tiled-dataset.mdratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.py
💤 Files with no reviewable changes (2)
- ratiopath/ray/aggregate/tensor_mean.py
- ratiopath/ray/aggregate/tensor_std.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/learn/get-started/quick-start/tiled-dataset.md (1)
80-80:⚠️ Potential issue | 🟡 MinorAlign the prose with the index-based implementation.
The text still describes per-slide streaming/filtering, but the example now builds a precomputed
slide_id -> rangeindex and usesselect(range)for retrieval. Please update this wording to match the current flow.Also applies to: 86-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/learn/get-started/quick-start/tiled-dataset.md` at line 80, Update the prose to reflect that retrieval uses a precomputed index mapping slide_id to a tile range and then fetches with select(range) instead of streaming/filtering per slide; specifically, replace references to "filtering/streaming per-slide" with language like "use the prebuilt slide_id -> range index to look up the tile range and call select(range) to load and cache that view on disk," and make the same change for the second occurrence that parallels this behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 147-157: run_end_encode returns a StructArray with fields "values"
and "run_ends" (not "run_lengths"); to build index ranges compute run lengths as
differences of consecutive run_ends and accumulate offsets rather than reading a
non-existent "run_lengths" field. Update the loop that builds index_map
(referencing run_end_encode and run_ends) to extract run_ends =
run_ends.field("run_ends").to_pylist() and values =
run_ends.field("values").to_pylist(), then iterate paired (value, run_end)
computing start = previous_end (initial 0) and end = run_end and set
index_map[value] = range(start, end), updating previous_end = end each
iteration.
---
Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Line 80: Update the prose to reflect that retrieval uses a precomputed index
mapping slide_id to a tile range and then fetches with select(range) instead of
streaming/filtering per slide; specifically, replace references to
"filtering/streaming per-slide" with language like "use the prebuilt slide_id ->
range index to look up the tile range and call select(range) to load and cache
that view on disk," and make the same change for the second occurrence that
parallels this behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1c5b84f8-9bc0-41f6-aacc-dae76c135dcb
📒 Files selected for processing (1)
docs/learn/get-started/quick-start/tiled-dataset.md
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/learn/get-started/quick-start/tiled-dataset.md (2)
80-86:⚠️ Potential issue | 🟡 MinorUpdate the explanation to match the new index-based
SlideDatasetflow.This text still describes per-slide dynamic filtering/streaming, but the snippet now uses sorted tiles + precomputed index +
select(range). Please align wording so readers don’t follow an outdated mental model.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 80 - 86, Update the explanation to reflect the index-based SlideDataset flow: replace references to per-slide dynamic filtering/streaming with a description that tiles are pre-sorted, a precomputed index maps slide → tile ranges, and SlideDataset uses select(range) to load those ranges (not repeated streaming/filtering). Mention that this enables efficient chunk access and caching of contiguous ranges on disk rather than per-slide streaming, and reference the SlideDataset class and its use of select(range) and the precomputed index/sorted tile ordering.
149-157:⚠️ Potential issue | 🟠 MajorFix range construction:
run_endsare cumulative ends, not per-run counts.Using
current_offset + countwithrun_endsproduces incorrect ranges after the first group. Build ranges as[previous_end, run_end).Proposed fix
- values = run_ends.field("values") - counts = run_ends.field("run_ends") + values = run_ends.field("values") + run_ends_arr = run_ends.field("run_ends") index_map = {} - current_offset = 0 + previous_end = 0 - for sid, count in zip(values.to_pylist(), counts.to_pylist()): - index_map[sid] = range(current_offset, current_offset + count) - current_offset += count + for sid, end in zip(values.to_pylist(), run_ends_arr.to_pylist()): + index_map[sid] = range(previous_end, end) + previous_end = endIn PyArrow Python, does pyarrow.compute.run_end_encode return cumulative run end offsets (end positions) or per-run lengths?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 149 - 157, The code incorrectly treats `counts` (from `run_ends.field("run_ends")`) as per-run lengths; instead treat them as cumulative run end offsets by replacing `current_offset` logic with a `previous_end` variable initialized to 0 and, inside the loop over `for sid, run_end in zip(values.to_pylist(), counts.to_pylist()):`, set `index_map[sid] = range(previous_end, run_end)` and then `previous_end = run_end`; update references to `current_offset` and `count` accordingly in the `values`/`counts` loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 80-86: Update the explanation to reflect the index-based
SlideDataset flow: replace references to per-slide dynamic filtering/streaming
with a description that tiles are pre-sorted, a precomputed index maps slide →
tile ranges, and SlideDataset uses select(range) to load those ranges (not
repeated streaming/filtering). Mention that this enables efficient chunk access
and caching of contiguous ranges on disk rather than per-slide streaming, and
reference the SlideDataset class and its use of select(range) and the
precomputed index/sorted tile ordering.
- Around line 149-157: The code incorrectly treats `counts` (from
`run_ends.field("run_ends")`) as per-run lengths; instead treat them as
cumulative run end offsets by replacing `current_offset` logic with a
`previous_end` variable initialized to 0 and, inside the loop over `for sid,
run_end in zip(values.to_pylist(), counts.to_pylist()):`, set `index_map[sid] =
range(previous_end, run_end)` and then `previous_end = run_end`; update
references to `current_offset` and `count` accordingly in the `values`/`counts`
loop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7734f826-55bb-435a-acac-4d367b97cca4
📒 Files selected for processing (1)
docs/learn/get-started/quick-start/tiled-dataset.md
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/learn/get-started/quick-start/tiled-dataset.md (1)
80-86:⚠️ Potential issue | 🟡 MinorUpdate narrative to match index-based selection flow.
Line 80 and Line 86 still describe per-slide dynamic filtering/streaming, but the example now precomputes
_slide_id_to_indicesand usesselect(range)for lookup. Please align the text with the new flow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 80 - 86, Update the narrative to describe index-based selection instead of per-slide streaming: explain that SlideDataset precomputes _slide_id_to_indices mapping and then uses dataset.select(range) to fetch tile chunks by index range for a slide, rather than dynamically filtering/streaming the entire tileset; reference _slide_id_to_indices, select(range) and SlideDataset to locate where to reword the paragraphs so they describe precomputation + index lookup and cached select-based retrieval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 147-150: The code is accessing a RunEndEncodedArray returned by
pyarrow.compute.run_end_encode() using .field("..."), which is invalid; update
the usage so the run_ends variable (result of run_end_encode) uses its
attributes directly: read the encoded values via the .values attribute and the
run-end positions via the .run_ends attribute on the RunEndEncodedArray (i.e.,
replace any run_ends.field("values") and run_ends.field("run_ends") calls with
direct attribute access on the run_ends object).
---
Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 80-86: Update the narrative to describe index-based selection
instead of per-slide streaming: explain that SlideDataset precomputes
_slide_id_to_indices mapping and then uses dataset.select(range) to fetch tile
chunks by index range for a slide, rather than dynamically filtering/streaming
the entire tileset; reference _slide_id_to_indices, select(range) and
SlideDataset to locate where to reword the paragraphs so they describe
precomputation + index lookup and cached select-based retrieval.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8f295be1-dd68-47a2-93c1-077748847161
📒 Files selected for processing (1)
docs/learn/get-started/quick-start/tiled-dataset.md
3154b67 to
7e207dc
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly improves the documentation for loading tiled datasets. The reorganization of the tiled-dataset.md guide provides a much clearer and more logical flow for the reader. The new SlideDataset implementation using a pre-built index is a fantastic performance improvement over the previous filtering approach and serves as an excellent best-practice example. The minor cleanups in other docstrings are also good. I've added one suggestion to further optimize the example code for memory efficiency.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/learn/get-started/quick-start/tiled-dataset.md (2)
80-86:⚠️ Potential issue | 🟡 MinorUpdate the explanation to match index-based selection flow.
These paragraphs still describe per-slide dynamic filtering/streaming, but the snippet now precomputes slide ranges and uses
select(range)from the index. Please align the text so readers don’t infer runtime filtering per sample.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 80 - 86, Rewrite the two paragraphs to reflect that selection is index-based and precomputed rather than per-sample streaming: describe that SlideDataset uses precomputed slide ranges from the index and calls select(range) to retrieve chunks for a slide, so the dataset performs index-based range selection and then leverages Hugging Face streaming/caching for the returned view — do not imply dynamic per-sample filtering at runtime. Mention the role of select(range) and the index-derived ranges explicitly and keep the note about smart caching applying to the cached view on disk.
147-150:⚠️ Potential issue | 🔴 CriticalUse the correct
RunEndEncodedArrayaccessors (.values/.run_ends).This snippet uses
.field("values")and.field("run_ends"); forpc.run_end_encode(...), Python API access is via dedicated attributes, and the current form is likely invalid for readers who copy this code.In current PyArrow Python API, what object does pyarrow.compute.run_end_encode(array) return, and what is the correct way to access run ends and values: `.run_ends`/`.values` or `.field("run_ends")`/`.field("values")`?Proposed doc-snippet fix
- run_ends = pc.run_end_encode(slide_ids) - - values = run_ends.field("values") - ends = run_ends.field("run_ends") + ree = pc.run_end_encode(slide_ids) + values = ree.values + ends = ree.run_ends🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 147 - 150, The snippet incorrectly uses .field("values")/.field("run_ends") on the result of pc.run_end_encode; replace those calls by using the RunEndEncodedArray attributes .values and .run_ends (i.e., after calling pc.run_end_encode(slide_ids) treat the result as a RunEndEncodedArray and access .values and .run_ends directly) so readers use the correct Python API for pc.run_end_encode.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 80-86: Rewrite the two paragraphs to reflect that selection is
index-based and precomputed rather than per-sample streaming: describe that
SlideDataset uses precomputed slide ranges from the index and calls
select(range) to retrieve chunks for a slide, so the dataset performs
index-based range selection and then leverages Hugging Face streaming/caching
for the returned view — do not imply dynamic per-sample filtering at runtime.
Mention the role of select(range) and the index-derived ranges explicitly and
keep the note about smart caching applying to the cached view on disk.
- Around line 147-150: The snippet incorrectly uses
.field("values")/.field("run_ends") on the result of pc.run_end_encode; replace
those calls by using the RunEndEncodedArray attributes .values and .run_ends
(i.e., after calling pc.run_end_encode(slide_ids) treat the result as a
RunEndEncodedArray and access .values and .run_ends directly) so readers use the
correct Python API for pc.run_end_encode.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 43fd834e-9858-43ea-932c-5fb1255899ca
📒 Files selected for processing (3)
docs/learn/get-started/quick-start/tiled-dataset.mdratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.py
💤 Files with no reviewable changes (2)
- ratiopath/ray/aggregate/tensor_mean.py
- ratiopath/ray/aggregate/tensor_std.py
|
/gemini review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
There was a problem hiding this comment.
Pull request overview
This PR updates documentation and examples around tiled dataset loading, and cleans up Ray aggregate function docstring examples.
Changes:
- Reorganized the tiled-dataset quick-start guide into a clearer progression (
TileDataset→ metadata backend →SlideDatasetorchestrator). - Updated the tiled-dataset example to use a persistent Hugging Face dataset instance and an index-based per-slide tile lookup.
- Cleaned up
TensorMean/TensorStddocstring examples by removing empty interactive prompts.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
ratiopath/ray/aggregate/tensor_std.py |
Removes empty doctest prompts in the TensorStd docstring example. |
ratiopath/ray/aggregate/tensor_mean.py |
Removes empty doctest prompts in the TensorMean docstring example. |
docs/learn/get-started/quick-start/tiled-dataset.md |
Reworks the tiled dataset loading narrative and updates the example to use a sorted tiles dataset + precomputed index for per-slide selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary by CodeRabbit