Skip to content

multi-label classification: CV with grouping and following iterative stratification #78

@necrosource-bot

Description

@necrosource-bot

Mirrored from scikit-multilearn/scikit-multilearn#310

It would be better to post this on your Slack channel, but I could not log in there ("It looks like there isn’t an account on scikit-ml tied to this email address. You can always try again with a different email, or find workspaces associated with this email.")

I was wondering if someone has experience training and evaluating models predicting many labels at once while at the same time separating train and test folds with some grouping variable (like sklearn's GroupKFold), so that the folds preserve the label-pair distribution across folds (via iterative stratification). The grouping is to make sure that samples that are extremely similar to each other end up in the same fold. For some labels, the dataset can be extremely unbalanced, like positives appearing at <<1%. Grouping should be on Xs and stratification should be on Ys.

Do you have any recommendations on how to do it? Should I first group by group_id and then apply second-order iterative stratification with scikit-multilearn on the group level? This does not seem to be an easy problem...

I see this issue, possibly related, but I do not understand what it refers to: scikit-multilearn#151

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions