-
Notifications
You must be signed in to change notification settings - Fork 458
Open
Description
Observed behavior
Hi, there are bugs in classification-and-pca-lab.ipynb for Lab 6 in the do_classify and classify_from_dataframe methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:
- No information from the testing data should be used in the model prediction as it is a form of data snooping. The testing dataset has been contaminated by this.
- The same variable is not being created during the transformation of the training and testing sets
Expected behavior
The training data mean and standard deviation should be used for standardizing the testing data like so:
dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()Xte = (subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()I think this was mentioned in one of the earlier lectures and here are some more references:
- https://stats.stackexchange.com/questions/202287/why-standardization-of-the-testing-set-has-to-be-performed-with-the-mean-and-sd
- https://sebastianraschka.com/faq/docs/scale-training-test.html
- https://www.researchgate.net/post/If_I_used_data_normalization_x-meanx_stdx_for_training_data_would_I_use_train_Mean_and_Standard_Deviation_to_normalize_test_data
Metadata
Metadata
Assignees
Labels
No labels