Skip to content

Conversation

@JATAYU000
Copy link
Contributor

Metadata

@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 57.45721% with 174 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.32%. Comparing base (c5f68bf) to head (96df5e3).

Files with missing lines Patch % Lines
openml/_api/resources/datasets.py 31.55% 128 Missing ⚠️
openml/datasets/functions.py 6.66% 14 Missing ⚠️
openml/_api/http/client.py 82.60% 12 Missing ⚠️
openml/_api/resources/tasks.py 87.23% 6 Missing ⚠️
openml/_api/runtime/fallback.py 0.00% 6 Missing ⚠️
openml/_api/runtime/core.py 81.48% 5 Missing ⚠️
openml/_api/__init__.py 75.00% 1 Missing ⚠️
openml/_api/config.py 96.87% 1 Missing ⚠️
openml/tasks/functions.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1608      +/-   ##
==========================================
+ Coverage   53.02%   53.32%   +0.29%     
==========================================
  Files          36       46      +10     
  Lines        4326     4645     +319     
==========================================
+ Hits         2294     2477     +183     
- Misses       2032     2168     +136     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JATAYU000
Copy link
Contributor Author

JATAYU000 commented Jan 9, 2026

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

  • download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
  • download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
  • download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

  • The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
  • Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
  • Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

@geetu040 geetu040 mentioned this pull request Jan 9, 2026
25 tasks
Copy link
Contributor

@geetu040 geetu040 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.

Comment on lines +31 to +38
def list(
self,
limit: int,
offset: int,
*,
data_id: list[int] | None = None, # type: ignore
**kwargs: Any,
) -> pd.DataFrame: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?

Comment on lines +35 to +42
def list(
self,
limit: int,
offset: int,
*,
data_id: list[int] | None = None, # type: ignore
**kwargs: Any,
) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this simple using private helper methods

bool
True if the deletion was successful. False otherwise.
"""
return openml.utils._delete_entity("data", dataset_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.

Comment on lines +456 to +461
def list(
self,
limit: int,
offset: int,
**kwargs: Any,
) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, it can use private helper methods

# Minimalistic check if the XML is useful
if "oml:data_qualities_list" not in qualities:
raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"')
from openml._api import api_context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants