Skip to content

Standardization of test data in Lab 6 should use training mean and standard deviation #11

@covuworie

Description

@covuworie

Observed behavior

Hi, there are bugs in classification-and-pca-lab.ipynb for Lab 6 in the do_classify and classify_from_dataframe methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:

  • No information from the testing data should be used in the model prediction as it is a form of data snooping. The testing dataset has been contaminated by this.
  • The same variable is not being created during the transformation of the training and testing sets

Expected behavior

The training data mean and standard deviation should be used for standardizing the testing data like so:

dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
Xte = (subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()

I think this was mentioned in one of the earlier lectures and here are some more references:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions