Solved – How to handle data normalization in kNN when new test data is received

k nearest neighbourmachine learningnormalization

I had a discussion with my colleagues about the following problem:

Lets say we have 100 points of labeled data and we are using $k$-nearest neighbor method for prediction. So our data looks like this:

$$X= \left\{(\textbf{x}_1, y_1),(\textbf{x}_2,y_2), \,…,(\textbf{x}_{100}, y_{100})\right\},$$

where $\textbf{x}_i\in \mathbb{R}^n,\;y_i\in\mathbb{R}$.

Now we use Leave-One-Out (or 10-fold etc.) cross validation for finding the best $k$-value. What we also do is that we normalize our data using z-score normalization. We do this at every iteration to prevent data snooping, i.e. at every iteration of Leave-One-Out we do the following:

  1. Drop the test point $(\textbf{x}_t, y_t)$out from training data
  2. Now we find the mean $\mu$ and standard deviation $\sigma$ of the training set $X_0=X\setminus (\textbf{x}_t, y_t)$ and use them to normalize the training data set $X_0$

  3. Now we normalize the test point $(\textbf{x}_t,y_t)$ with the mean $\mu$ and standard deviation $\sigma$ we got from $X_0$ and make a prediction for the test point.

  4. We repeat this process for whole data set and we get some error value $E$.

Lets say now that we found our optimal $k$-value to be $k=3$. So we have solved our optimal model to be $3$-nearest neighbor model and we prevented data snooping by not including the test point $(\textbf{x}_t,y_t)$ in the calculations of $\mu$ and $\sigma$ we needed for the normalization.

Now we receive new 50 unlabeled data points: $Z=\{\textbf{z}_1,…\textbf{z}_{50}\}$. We need to make predictions for these points. Now finally we get to the question:

How should we handle normalization of the test data now? Should we
find the mean $\mu$ and standard deviation $\sigma$ of the set $X$ to
normalize the new data points $Z$ OR should we use both of the
sets $X$ and $Z$ to find out the $\mu$ and $\sigma$ used for
normalization?

So my question is about how should the $\mu$ and $\sigma$ be calculated when we apply our model to the new set $Z$? Because we want to normalize the set $Z$…and for the normalization we need a mean $\mu$ and standard deviation $\sigma$…but the question is that what data should we use for calculating $\mu$ and $\sigma$? What would be the correct way to do this?

P.S. note that my $\mu$ and $\sigma$ are vectors =) so each of the $n$ features has its own mean and std.

Hope my question is clear =) Thank you for any help! Please ask if my question is unclear

Best Answer

Your validation process and the reasoning behind is entirely correct.

Using the same reasoning / model building process: After you have selected a k by validation, you build the final model using all the training data $X$ and calculate the mean and variance based only on $X$ since these values are also part of the model.

Additionally: A classification model should classify new unlabeled instances independent of each other. But if you calculate mean and variance also based on $Z$, then the prediction might change for the same unlabeled instance dependent on how the rest of $Z$ looks like. This is not correct.

I guess the confusion originates from k-nearest-neighbor being a lazy learner, i.e. storing all the instances instead of deriving a model with reduced complexity. In other learners, this is not done, so calculating the normalization parameters across the whole combined set is not even possible. See this related question without a specific learner: Perform feature normalization before or within model validation?