Solved – Finding the optimal value of k in the k-nearest-neighbor classifier: is this cross-validation

cross-validationk nearest neighbour

I have collected 1000 data points with each data point belonging to eight categories. I would like to be able to correctly estimate the categories of any new data by using the k-nearest-neighbor classifier. This requires me to find the best value of the hyperparameter k. What I would like to do is to try various values of k, maybe from 1 to 40, then take every data point that I have (because why not use them all?) and see if it is correctly classified using the given value of k. Then for each k I will get an overall error proportion (how many of my data points were correctly classified) and I will use the k that gave the lowest error. Is this a reasonable way to do this? What is this method called; is it cross-validation where the size of each verification data set is equal to one? Is that the same as leave-one-out cross-validation?

Best Answer

That's almost right, but there's one extra step needed, which is to fit the model and choose $k$ using different sets of data (fitting in the case of knn being just remembering all the data points). If you fit the model and select the hyperparameters on the same set, the error will underestimate the true generalization error, which is the error you'd see on new data drawn from the same distribution. This means the classifier may do worse on new data than you'd expect. Cross validation is a way to choose $k$ that tries to minimize the generalization error.

Here's how to select $k$ using $d$-fold cross validation. It's usually called $k$-fold, but we're using $k$ to talk about the number of neighbors.

  1. Split the data into $d$ disjoint, similarly-sized subsets
  2. Hold out the first set. This is called the validation set.
  3. Train your classifier on the remaining data. In the case of knn classification, just remember all the data.
  4. For each value of $k$:
    • Classify each point in the validation set, using its $k$ nearest neighbors in the training set
    • Record the error
  5. Repeat steps 1-4 for all $d$ choices of the validation set.
  6. For each choice of $k$, find the average error across validation sets. Choose the value of $k$ with the lowest error.
  7. Construct a final classifier using all of the original data and the chosen value of $k$. This is what you'd use to classify new points.

In the case where $d$ is equal to the number of data points (i.e. each validation set contains a single point), this is called leave-one-out cross validation.

If you want to estimate the generalization error of the final classifier, some extra work is needed. You can't just take the error on the validation sets, because $k$ was chosen to minimize this, so it's an underestimate of the true generalization error. Another way of thinking about this is that procedure for choosing $k$ has to be considered part of the learning algorithm. The simplest thing to do is test the final classifier on a separate set of data you held out at the beginning (called the test set). This is a reasonable option if you have lots of data and a slow algorithm. Another option that takes longer but uses the data more efficiently is to use cross validation again. Here, there will be an 'outer' cross validation loop, where the held-out data is called the test set. At each step, you further split the data into training and validation sets and run a nested, 'inner' cross validation loop to choose $k$, as above. The final generalization error is estimated by classifying the test sets and averaging their errors.

Related Question