Skip to content

Feature/update docs#45

Merged
JakubPekar merged 6 commits into
mainfrom
feature/update-docs
Mar 26, 2026
Merged

Feature/update docs#45
JakubPekar merged 6 commits into
mainfrom
feature/update-docs

Conversation

@JakubPekar
Copy link
Copy Markdown
Collaborator

@JakubPekar JakubPekar commented Mar 19, 2026

Summary by CodeRabbit

  • Documentation
    • Reorganized the tiled-dataset guide to a clearer flow covering core tile dataset, metadata management, and a slide-level orchestrator.
    • Clarified how tile-level data is referenced and accessed per slide via an index-based lookup.
    • Updated examples to use persistent dataset instances and explicit per-slide selection.
    • Cleaned up example docstrings by removing extraneous output placeholders.

@JakubPekar JakubPekar requested review from a team and Copilot March 19, 2026 09:44
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the documentation for constructing and efficiently loading tiled datasets, particularly in large-scale computer vision applications like digital pathology. It refactors the explanation of data loading pipelines, clarifies the roles of different dataset components, and introduces a performance optimization for filtering tile metadata, ensuring that the system can handle massive datasets without memory issues.

Highlights

  • Documentation Restructuring: The documentation for building tiled datasets has been significantly reorganized to improve clarity and flow, reordering sections and renaming key components like 'The Main Torch Dataset' to 'The Orchestrator: SlideDataset'.
  • Optimized Tile Filtering: Implemented a more efficient method for filtering tiles within the SlideDataset by pre-sorting the tiles_dataset by slide_id and creating a fast lookup index using pyarrow.compute.value_counts. This replaces the less performant filter lambda with direct index selection.
  • Enhanced Explanation of Hugging Face datasets: The role and efficiency of the Hugging Face datasets library for managing large tabular metadata are now explained in greater detail, emphasizing its lazy loading, zero RAM overhead, and O(1) random access capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Reorganized the tiled-dataset quick-start and refactored the SlideDataset example to persist HF slide/tile datasets, precompute a tile index using pyarrow, add a slice-based filter_tiles_by_slide, and update imports; also removed stray output placeholders from tensor aggregation docstrings.

Changes

Cohort / File(s) Summary
Tiled-dataset tutorial
docs/learn/get-started/quick-start/tiled-dataset.md
Reordered sections to introduce TileDataset, HF datasets metadata, and SlideDataset orchestrator. SlideDataset example now assigns self.slides_dataset and self.tiles_dataset (tiles sorted by slide_id), adds @staticmethod _build_tile_index(tiles: HFDataset) -> dict[str, range] using pyarrow.compute.run_end_encode, stores self._slide_id_to_indices, and exposes filter_tiles_by_slide(self, slide_id: str) -> HFDataset that returns self.tiles_dataset.select(tile_range). Updated imports (import pyarrow.compute as pc, Dataset as HFDataset) and final usage text.
Docstring examples
ratiopath/ray/aggregate/tensor_mean.py, ratiopath/ray/aggregate/tensor_std.py
Removed stray/empty output placeholder lines (>>>) from example docstrings; no functional or API changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • feat: dataset tutorial #37: Modifies the same tiled-dataset tutorial and SlideDataset example with a similar refactor (pre-sorted tiles, _build_tile_index, and filter_tiles_by_slide).

Suggested reviewers

  • vejtek

Poem

🐇 I hopped through docs with nimble paws,
Rearranged tiles and tuned the laws.
An index baked with pyarrow's art,
Slices ready, neat to start.
Hooray — the tutorial hums with heart.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Feature/update docs' is overly vague and generic; it lacks specificity about which documentation was updated or what the main focus of the changes are. Consider a more specific title such as 'Refactor TileDataset documentation with improved indexing strategy' or 'Update tiled-dataset.md documentation and fix docstring examples' to better convey the primary changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/update-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the documentation for loading tiled datasets by introducing a much more efficient data loading strategy. The new approach, which involves pre-sorting the tile metadata and building an index, is a great enhancement. However, I found a critical bug in the implementation of the index-building logic (_build_tile_index) in the example code. The use of pc.value_counts is incorrect as it doesn't guarantee order, which can lead to silent data corruption. I've provided a corrected implementation using pc.run_end_encode. The other changes are minor docstring cleanups which look good.

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates documentation/examples around tiled dataset loading and cleans up Ray tensor aggregator docstrings.

Changes:

  • Refactors tiled-dataset.md to better explain TileDataset/SlideDataset and introduces an index-based tile lookup example.
  • Adds a pyarrow.compute-based _build_tile_index example to avoid per-slide filter(...) lambdas.
  • Removes stray blank >>> lines from TensorMean/TensorStd docstring examples.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ratiopath/ray/aggregate/tensor_std.py Docstring doctest formatting cleanup (removes empty prompts).
ratiopath/ray/aggregate/tensor_mean.py Docstring doctest formatting cleanup (removes empty prompts).
docs/learn/get-started/quick-start/tiled-dataset.md Reworks quick-start narrative and updates example code to use a precomputed tile index for per-slide selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Line 80: The docs still describe per-slide streaming/filtering but the
implementation now uses SlideDataset with a precomputed index, sort, and
index-based selection; update the wording (the sentences that mention Smart
Caching / per-slide filtering and dynamic streaming) to explain that
SlideDataset precomputes an index, sorts tiles, and then you create a view with
select(range) (or select by index range) which is cached to disk, rather than
streaming and filtering per-slide at read time; mention SlideDataset, sort,
precomputed index, and select(range) explicitly so readers know how the
cache/view is produced.
- Around line 146-157: The current logic uses pc.value_counts(slide_ids)
(assigned to counts) and then accumulates offsets into index_map using
current_offset, but value_counts has unspecified ordering so ranges can be
wrong; instead iterate the slide_ids sequence in order (or use a sorted copy if
needed) to detect contiguous runs of the same slide id and build index_map by
recording start and end indices for each run (using the same index positions as
in slide_ids) rather than relying on value_counts output; update the code
referencing counts, index_map, current_offset, and slide_ids to produce ranges
from sequential traversal of slide_ids so tiles map to the correct slide ids.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ccd72f5-22d1-4c73-a73b-e0e5d4eed790

📥 Commits

Reviewing files that changed from the base of the PR and between 3f3ef2d and 9ec9e10.

📒 Files selected for processing (3)
  • docs/learn/get-started/quick-start/tiled-dataset.md
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py
💤 Files with no reviewable changes (2)
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md
Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/learn/get-started/quick-start/tiled-dataset.md (1)

80-80: ⚠️ Potential issue | 🟡 Minor

Align the prose with the index-based implementation.

The text still describes per-slide streaming/filtering, but the example now builds a precomputed slide_id -> range index and uses select(range) for retrieval. Please update this wording to match the current flow.

Also applies to: 86-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/learn/get-started/quick-start/tiled-dataset.md` at line 80, Update the
prose to reflect that retrieval uses a precomputed index mapping slide_id to a
tile range and then fetches with select(range) instead of streaming/filtering
per slide; specifically, replace references to "filtering/streaming per-slide"
with language like "use the prebuilt slide_id -> range index to look up the tile
range and call select(range) to load and cache that view on disk," and make the
same change for the second occurrence that parallels this behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 147-157: run_end_encode returns a StructArray with fields "values"
and "run_ends" (not "run_lengths"); to build index ranges compute run lengths as
differences of consecutive run_ends and accumulate offsets rather than reading a
non-existent "run_lengths" field. Update the loop that builds index_map
(referencing run_end_encode and run_ends) to extract run_ends =
run_ends.field("run_ends").to_pylist() and values =
run_ends.field("values").to_pylist(), then iterate paired (value, run_end)
computing start = previous_end (initial 0) and end = run_end and set
index_map[value] = range(start, end), updating previous_end = end each
iteration.

---

Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Line 80: Update the prose to reflect that retrieval uses a precomputed index
mapping slide_id to a tile range and then fetches with select(range) instead of
streaming/filtering per slide; specifically, replace references to
"filtering/streaming per-slide" with language like "use the prebuilt slide_id ->
range index to look up the tile range and call select(range) to load and cache
that view on disk," and make the same change for the second occurrence that
parallels this behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c5b84f8-9bc0-41f6-aacc-dae76c135dcb

📥 Commits

Reviewing files that changed from the base of the PR and between 9ec9e10 and e8685b5.

📒 Files selected for processing (1)
  • docs/learn/get-started/quick-start/tiled-dataset.md

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Adames4
Adames4 previously approved these changes Mar 19, 2026
Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/learn/get-started/quick-start/tiled-dataset.md (2)

80-86: ⚠️ Potential issue | 🟡 Minor

Update the explanation to match the new index-based SlideDataset flow.

This text still describes per-slide dynamic filtering/streaming, but the snippet now uses sorted tiles + precomputed index + select(range). Please align wording so readers don’t follow an outdated mental model.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 80 - 86,
Update the explanation to reflect the index-based SlideDataset flow: replace
references to per-slide dynamic filtering/streaming with a description that
tiles are pre-sorted, a precomputed index maps slide → tile ranges, and
SlideDataset uses select(range) to load those ranges (not repeated
streaming/filtering). Mention that this enables efficient chunk access and
caching of contiguous ranges on disk rather than per-slide streaming, and
reference the SlideDataset class and its use of select(range) and the
precomputed index/sorted tile ordering.

149-157: ⚠️ Potential issue | 🟠 Major

Fix range construction: run_ends are cumulative ends, not per-run counts.

Using current_offset + count with run_ends produces incorrect ranges after the first group. Build ranges as [previous_end, run_end).

Proposed fix
-        values = run_ends.field("values")
-        counts = run_ends.field("run_ends")
+        values = run_ends.field("values")
+        run_ends_arr = run_ends.field("run_ends")

         index_map = {}
-        current_offset = 0
+        previous_end = 0

-        for sid, count in zip(values.to_pylist(), counts.to_pylist()):
-            index_map[sid] = range(current_offset, current_offset + count)
-            current_offset += count
+        for sid, end in zip(values.to_pylist(), run_ends_arr.to_pylist()):
+            index_map[sid] = range(previous_end, end)
+            previous_end = end
In PyArrow Python, does pyarrow.compute.run_end_encode return cumulative run end offsets (end positions) or per-run lengths?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 149 - 157,
The code incorrectly treats `counts` (from `run_ends.field("run_ends")`) as
per-run lengths; instead treat them as cumulative run end offsets by replacing
`current_offset` logic with a `previous_end` variable initialized to 0 and,
inside the loop over `for sid, run_end in zip(values.to_pylist(),
counts.to_pylist()):`, set `index_map[sid] = range(previous_end, run_end)` and
then `previous_end = run_end`; update references to `current_offset` and `count`
accordingly in the `values`/`counts` loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 80-86: Update the explanation to reflect the index-based
SlideDataset flow: replace references to per-slide dynamic filtering/streaming
with a description that tiles are pre-sorted, a precomputed index maps slide →
tile ranges, and SlideDataset uses select(range) to load those ranges (not
repeated streaming/filtering). Mention that this enables efficient chunk access
and caching of contiguous ranges on disk rather than per-slide streaming, and
reference the SlideDataset class and its use of select(range) and the
precomputed index/sorted tile ordering.
- Around line 149-157: The code incorrectly treats `counts` (from
`run_ends.field("run_ends")`) as per-run lengths; instead treat them as
cumulative run end offsets by replacing `current_offset` logic with a
`previous_end` variable initialized to 0 and, inside the loop over `for sid,
run_end in zip(values.to_pylist(), counts.to_pylist()):`, set `index_map[sid] =
range(previous_end, run_end)` and then `previous_end = run_end`; update
references to `current_offset` and `count` accordingly in the `values`/`counts`
loop.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7734f826-55bb-435a-acac-4d367b97cca4

📥 Commits

Reviewing files that changed from the base of the PR and between e8685b5 and f0898f4.

📒 Files selected for processing (1)
  • docs/learn/get-started/quick-start/tiled-dataset.md

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/learn/get-started/quick-start/tiled-dataset.md (1)

80-86: ⚠️ Potential issue | 🟡 Minor

Update narrative to match index-based selection flow.

Line 80 and Line 86 still describe per-slide dynamic filtering/streaming, but the example now precomputes _slide_id_to_indices and uses select(range) for lookup. Please align the text with the new flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 80 - 86,
Update the narrative to describe index-based selection instead of per-slide
streaming: explain that SlideDataset precomputes _slide_id_to_indices mapping
and then uses dataset.select(range) to fetch tile chunks by index range for a
slide, rather than dynamically filtering/streaming the entire tileset; reference
_slide_id_to_indices, select(range) and SlideDataset to locate where to reword
the paragraphs so they describe precomputation + index lookup and cached
select-based retrieval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 147-150: The code is accessing a RunEndEncodedArray returned by
pyarrow.compute.run_end_encode() using .field("..."), which is invalid; update
the usage so the run_ends variable (result of run_end_encode) uses its
attributes directly: read the encoded values via the .values attribute and the
run-end positions via the .run_ends attribute on the RunEndEncodedArray (i.e.,
replace any run_ends.field("values") and run_ends.field("run_ends") calls with
direct attribute access on the run_ends object).

---

Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 80-86: Update the narrative to describe index-based selection
instead of per-slide streaming: explain that SlideDataset precomputes
_slide_id_to_indices mapping and then uses dataset.select(range) to fetch tile
chunks by index range for a slide, rather than dynamically filtering/streaming
the entire tileset; reference _slide_id_to_indices, select(range) and
SlideDataset to locate where to reword the paragraphs so they describe
precomputation + index lookup and cached select-based retrieval.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f295be1-dd68-47a2-93c1-077748847161

📥 Commits

Reviewing files that changed from the base of the PR and between f0898f4 and 3154b67.

📒 Files selected for processing (1)
  • docs/learn/get-started/quick-start/tiled-dataset.md

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Adames4
Adames4 previously approved these changes Mar 20, 2026
matejpekar
matejpekar previously approved these changes Mar 21, 2026
@matejpekar matejpekar force-pushed the feature/update-docs branch from 3154b67 to 7e207dc Compare March 21, 2026 20:03
@matejpekar
Copy link
Copy Markdown
Member

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the documentation for loading tiled datasets. The reorganization of the tiled-dataset.md guide provides a much clearer and more logical flow for the reader. The new SlideDataset implementation using a pre-built index is a fantastic performance improvement over the previous filtering approach and serves as an excellent best-practice example. The minor cleanups in other docstrings are also good. I've added one suggestion to further optimize the example code for memory efficiency.

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/learn/get-started/quick-start/tiled-dataset.md (2)

80-86: ⚠️ Potential issue | 🟡 Minor

Update the explanation to match index-based selection flow.

These paragraphs still describe per-slide dynamic filtering/streaming, but the snippet now precomputes slide ranges and uses select(range) from the index. Please align the text so readers don’t infer runtime filtering per sample.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 80 - 86,
Rewrite the two paragraphs to reflect that selection is index-based and
precomputed rather than per-sample streaming: describe that SlideDataset uses
precomputed slide ranges from the index and calls select(range) to retrieve
chunks for a slide, so the dataset performs index-based range selection and then
leverages Hugging Face streaming/caching for the returned view — do not imply
dynamic per-sample filtering at runtime. Mention the role of select(range) and
the index-derived ranges explicitly and keep the note about smart caching
applying to the cached view on disk.

147-150: ⚠️ Potential issue | 🔴 Critical

Use the correct RunEndEncodedArray accessors (.values / .run_ends).

This snippet uses .field("values") and .field("run_ends"); for pc.run_end_encode(...), Python API access is via dedicated attributes, and the current form is likely invalid for readers who copy this code.

In current PyArrow Python API, what object does pyarrow.compute.run_end_encode(array) return, and what is the correct way to access run ends and values: `.run_ends`/`.values` or `.field("run_ends")`/`.field("values")`?
Proposed doc-snippet fix
-        run_ends = pc.run_end_encode(slide_ids)
-
-        values = run_ends.field("values")
-        ends = run_ends.field("run_ends")
+        ree = pc.run_end_encode(slide_ids)
+        values = ree.values
+        ends = ree.run_ends
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/learn/get-started/quick-start/tiled-dataset.md` around lines 147 - 150,
The snippet incorrectly uses .field("values")/.field("run_ends") on the result
of pc.run_end_encode; replace those calls by using the RunEndEncodedArray
attributes .values and .run_ends (i.e., after calling
pc.run_end_encode(slide_ids) treat the result as a RunEndEncodedArray and access
.values and .run_ends directly) so readers use the correct Python API for
pc.run_end_encode.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/learn/get-started/quick-start/tiled-dataset.md`:
- Around line 80-86: Rewrite the two paragraphs to reflect that selection is
index-based and precomputed rather than per-sample streaming: describe that
SlideDataset uses precomputed slide ranges from the index and calls
select(range) to retrieve chunks for a slide, so the dataset performs
index-based range selection and then leverages Hugging Face streaming/caching
for the returned view — do not imply dynamic per-sample filtering at runtime.
Mention the role of select(range) and the index-derived ranges explicitly and
keep the note about smart caching applying to the cached view on disk.
- Around line 147-150: The snippet incorrectly uses
.field("values")/.field("run_ends") on the result of pc.run_end_encode; replace
those calls by using the RunEndEncodedArray attributes .values and .run_ends
(i.e., after calling pc.run_end_encode(slide_ids) treat the result as a
RunEndEncodedArray and access .values and .run_ends directly) so readers use the
correct Python API for pc.run_end_encode.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43fd834e-9858-43ea-932c-5fb1255899ca

📥 Commits

Reviewing files that changed from the base of the PR and between 3154b67 and 7e207dc.

📒 Files selected for processing (3)
  • docs/learn/get-started/quick-start/tiled-dataset.md
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py
💤 Files with no reviewable changes (2)
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py

@JakubPekar JakubPekar dismissed stale reviews from Adames4 and matejpekar via 15c7573 March 24, 2026 20:59
@JakubPekar
Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates documentation and examples around tiled dataset loading, and cleans up Ray aggregate function docstring examples.

Changes:

  • Reorganized the tiled-dataset quick-start guide into a clearer progression (TileDataset → metadata backend → SlideDataset orchestrator).
  • Updated the tiled-dataset example to use a persistent Hugging Face dataset instance and an index-based per-slide tile lookup.
  • Cleaned up TensorMean/TensorStd docstring examples by removing empty interactive prompts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
ratiopath/ray/aggregate/tensor_std.py Removes empty doctest prompts in the TensorStd docstring example.
ratiopath/ray/aggregate/tensor_mean.py Removes empty doctest prompts in the TensorMean docstring example.
docs/learn/get-started/quick-start/tiled-dataset.md Reworks the tiled dataset loading narrative and updates the example to use a sorted tiles dataset + precomputed index for per-slide selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/learn/get-started/quick-start/tiled-dataset.md
Comment thread docs/learn/get-started/quick-start/tiled-dataset.md
Comment thread docs/learn/get-started/quick-start/tiled-dataset.md
@JakubPekar JakubPekar merged commit b0062e0 into main Mar 26, 2026
10 checks passed
@JakubPekar JakubPekar deleted the feature/update-docs branch March 26, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants