multi-label classification: CV with grouping and following iterative stratification

> **Mirrored from [scikit-multilearn/scikit-multilearn#310](https://github.com/scikit-multilearn/scikit-multilearn/issues/310)**

It would be better to post this on your Slack channel, but I could not log in there (_"It looks like there isn’t an account on scikit-ml tied to this email address. You can always try again with a different email, or find workspaces associated with this email."_)

I was wondering if someone has experience training and evaluating models predicting many labels at once while at the same time separating train and test folds with some grouping variable (like sklearn's ```GroupKFold```), so that the folds preserve the label-pair distribution across folds (via iterative stratification). The grouping is to make sure that samples that are extremely similar to each other end up in the same fold. For some labels, the dataset can be extremely unbalanced, like positives appearing at <<1%. Grouping should be on Xs and stratification should be on Ys.
 
Do you have any recommendations on how to do it? Should I first group by ```group_id``` and then apply second-order iterative stratification with ```scikit-multilearn``` on the group level? This does not seem to be an easy problem...

I see this issue, possibly related, but I do not understand what it refers to: https://github.com/scikit-multilearn/scikit-multilearn/issues/151

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-label classification: CV with grouping and following iterative stratification #78

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

multi-label classification: CV with grouping and following iterative stratification #78

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions