-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
It is unclear how perturbation_stats should handle multiple Subkeys with the same origin (thus the same column name in df).
Currently attempting to group on a duplicated column throws ValueError: Grouper for 'subsample' not 1-dimensional.
The illustrative example of this issue comes if we take the exact example pipeline from #35 and attempt to use a single subsample Vset with output_matching=False (so the X_trains/X_tests will match properly) instead of the two. Now if we want to predict with uncertainty over subsamples, it is unclear what this means. I think there are 2 cases:
- My initial thought we could implement a way to distinguish identical mismatched Subkeys (maybe by appending
-i) - Alternatively/additionally we could try to support multidimensional grouping in
perturbation_stats
Illustrative Example
X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])
subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs)
X_trains, y_trains = subsampling_set(X_train, y_train)
X_tests, y_tests = subsampling_set(X_test, y_test)
models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)
# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}Metadata
Metadata
Assignees
Labels
No labels