Solved – Computing the training and testing error on $k$ nearest neighbors

I'm currently learning the $k$-nearest neighbors algorithm and am somewhat confused on the difference between training the model on training data and testing it on separate testing data. I will refer to my training data points as $(x_i,y_i)$ and testing data points as $(\bar{x_j}, \bar{y_j})$.

Since $k$-nearest neighbors gives a function $$\hat{y}(x) = \frac{1}{k}\sum_{x_i\in N_k(x)}y_i$$
then how am I to use the $\hat{y}$ received in trying to find the testing error?
For example, in linear least squares one receives a $\hat{\beta}$ that approximates $\beta$ such that $y = x\beta$. Thus one can use the training data to approximate $y$ to $\hat{y}(x_i) = x_i\hat{\beta}$ and use the test data for $\hat{y}(\bar{x_j}) = \bar{x_j}\hat{\beta}$. Finally one can compute the training error between this $\hat{y}(x)$ and $y$ as well as the test error between $\hat{y}(\bar{x})$ and $\bar{y}$, is this correct? If so then, isn't the test error saying how well the model received from the training process works on other data? If this is all correct, then my main question is how does the model received from the training process of $k$-nearest neighbors affect the test error?

I do not see how the training and testing error have any relation in KNN as they do in least squares, where one can directly implement the function $\hat{y}$ received from the training process in the testing process. Could I receive any help on how the training process in KNN affects the testing process?

Thanks

Best Answer

I will share a picture with you to clear your ambiguities.

Assume you've got the training data in 2D space that are labeled either red or green. On the left figure, you've got a test data point (in gray). According to k-NN (the equation that you wrote) $$\hat{y}(x) = \frac{1}{k}\sum_{x_i\in N_k(x)}y_i$$ The $y_i$'s are the training data, where the $x$ is the testing point. So, after we compute this equation, (see the right figure), we can judge where this point belongs (either red or green in our case).

Best Answer

Related Solutions

Solved – Why is KNN not “model-based”

Solved – SVM training and testing error interpretation

Related Question