[ENH] V1 → V2 API Migration - datasets #1608

JATAYU000 · 2026-01-08T10:30:37Z

Metadata

Reference Issue: [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions

codecov-commenter · 2026-01-08T10:36:04Z

Codecov Report

❌ Patch coverage is 57.45721% with 174 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.32%. Comparing base (c5f68bf) to head (96df5e3).

Files with missing lines	Patch %	Lines
openml/_api/resources/datasets.py	31.55%	128 Missing ⚠️
openml/datasets/functions.py	6.66%	14 Missing ⚠️
openml/_api/http/client.py	82.60%	12 Missing ⚠️
openml/_api/resources/tasks.py	87.23%	6 Missing ⚠️
openml/_api/runtime/fallback.py	0.00%	6 Missing ⚠️
openml/_api/runtime/core.py	81.48%	5 Missing ⚠️
openml/_api/__init__.py	75.00%	1 Missing ⚠️
openml/_api/config.py	96.87%	1 Missing ⚠️
openml/tasks/functions.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1608      +/-   ##
==========================================
+ Coverage   53.02%   53.32%   +0.29%     
==========================================
  Files          36       46      +10     
  Lines        4326     4645     +319     
==========================================
+ Hits         2294     2477     +183     
- Misses       2032     2168     +136

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JATAYU000 · 2026-01-09T05:08:45Z

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

geetu040

Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.

geetu040 · 2026-01-13T17:43:32Z

openml/_api/resources/base.py

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        *,
+        data_id: list[int] | None = None,  # type: ignore
+        **kwargs: Any,
+    ) -> pd.DataFrame: ...


can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?

geetu040 · 2026-01-13T17:43:43Z

openml/_api/resources/datasets.py

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        *,
+        data_id: list[int] | None = None,  # type: ignore
+        **kwargs: Any,
+    ) -> pd.DataFrame:


you can make this simple using private helper methods

geetu040 · 2026-01-13T17:43:53Z

openml/_api/resources/datasets.py

+        bool
+            True if the deletion was successful. False otherwise.
+        """
+        return openml.utils._delete_entity("data", dataset_id)


if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.

geetu040 · 2026-01-13T17:43:57Z

openml/_api/resources/datasets.py

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        **kwargs: Any,
+    ) -> pd.DataFrame:


same as above, it can use private helper methods

geetu040 · 2026-01-13T17:44:00Z

openml/datasets/functions.py

-    # Minimalistic check if the XML is useful
-    if "oml:data_qualities_list" not in qualities:
-        raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"')
+    from openml._api import api_context


can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.

geetu040 and others added 11 commits December 30, 2025 09:11

set up folder structure and base code

0159f47

Merge branch 'main' into migration

58e9175

Merge branch 'main' into migration

bdd65ff

fix pre-commit

52ef379

Merge base migration pr, ruff

f7ba710

refactor

5dfcbce

implement cache_dir

2acbe99

refactor

af99880

Merge branch 'main' into pr/1576

74ab366

edit, fork, delete updated

8964517

Added features, updated list

1c2fa99

JATAYU000 added 3 commits January 9, 2026 10:49

Merge commit pull/1576 into dataset_resource

18e85de

Refactor functions, except get

9bcbcb3

Remove circular import using lazy import

96df5e3

geetu040 mentioned this pull request Jan 9, 2026

[ENH] V1 → V2 API Migration #1575

Open

25 tasks

geetu040 suggested changes Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ENH] V1 → V2 API Migration - datasets #1608

[ENH] V1 → V2 API Migration - datasets #1608

Uh oh!

JATAYU000 commented Jan 8, 2026

Uh oh!

codecov-commenter commented Jan 8, 2026 •

edited

Loading

Uh oh!

JATAYU000 commented Jan 9, 2026 •

edited

Loading

Uh oh!

geetu040 left a comment •

edited

Loading

Uh oh!

geetu040 Jan 13, 2026

Uh oh!

geetu040 Jan 13, 2026

Uh oh!

geetu040 Jan 13, 2026

Uh oh!

geetu040 Jan 13, 2026

Uh oh!

geetu040 Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[ENH] V1 → V2 API Migration - datasets #1608

Are you sure you want to change the base?

[ENH] V1 → V2 API Migration - datasets #1608

Uh oh!

Conversation

JATAYU000 commented Jan 8, 2026

Metadata

Uh oh!

codecov-commenter commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JATAYU000 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues:

Example:

Uh oh!

geetu040 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geetu040 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

geetu040 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

geetu040 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

geetu040 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

geetu040 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 8, 2026 •

edited

Loading

JATAYU000 commented Jan 9, 2026 •

edited

Loading

geetu040 left a comment •

edited

Loading