KNN and K-Folding in R – Effective Techniques for Classification

caretclassificationcross-validationk nearest neighbourr

I'd like to use KNN to build a classifier in R.

I'd like to use various K numbers using 5 fold CV each time – how would I report the accuracy for each value of K (KNN).

I'm using the knn() function in R – I've also been using caret so I can use traincontrol(), but I'm confused about how to do this?
I know I haven't included the data, but I'm looking more for the approach.

Best Answer

To use 5-fold cross validation in caret, you can set the "train control" as follows:

trControl <- trainControl(method  = "cv",
                          number  = 5)

Then you can evaluate the accuracy of the KNN classifier with different values of k by cross validation using

fit <- train(Species ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = trControl,
             metric     = "Accuracy",
             data       = iris)

Output:

k-Nearest Neighbors 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 120, 120, 120, 120, 120 
Resampling results across tuning parameters:

  k   Accuracy   Kappa
   1  0.9600000  0.94 
   2  0.9600000  0.94 
   3  0.9600000  0.94 
   4  0.9533333  0.93 
   5  0.9733333  0.96 
   6  0.9666667  0.95 
   7  0.9600000  0.94 
   8  0.9666667  0.95 
   9  0.9733333  0.96 
  10  0.9600000  0.94 

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was k = 9.

Useful ref: http://topepo.github.io/caret/index.html