KNN and K-Folding in R – Effective Techniques for Classification

caretclassificationcross-validationk nearest neighbourr

I'd like to use KNN to build a classifier in R.

I'd like to use various K numbers using 5 fold CV each time – how would I report the accuracy for each value of K (KNN).

I'm using the knn() function in R – I've also been using caret so I can use traincontrol(), but I'm confused about how to do this?
I know I haven't included the data, but I'm looking more for the approach.

Best Answer

To use 5-fold cross validation in caret, you can set the "train control" as follows:

trControl <- trainControl(method  = "cv",
                          number  = 5)

Then you can evaluate the accuracy of the KNN classifier with different values of k by cross validation using

fit <- train(Species ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = trControl,
             metric     = "Accuracy",
             data       = iris)

Output:

k-Nearest Neighbors 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 120, 120, 120, 120, 120 
Resampling results across tuning parameters:

  k   Accuracy   Kappa
   1  0.9600000  0.94 
   2  0.9600000  0.94 
   3  0.9600000  0.94 
   4  0.9533333  0.93 
   5  0.9733333  0.96 
   6  0.9666667  0.95 
   7  0.9600000  0.94 
   8  0.9666667  0.95 
   9  0.9733333  0.96 
  10  0.9600000  0.94 

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was k = 9.

Useful ref: http://topepo.github.io/caret/index.html

Related Solutions

Solved – Feature selection + classification in Caret

You should be able to accomplish everything you want with the sbf function instead. I originally assumed it worked the same way you are, but the functionality given by sbf is apparently more like a super set of what's available in train.

For example, something like this sounds like what you're getting at:

fit <- sbf(
  form = response ~ .,
  data = d, method = "glmnet", 
  tuneGrid=expand.grid(.alpha = .01, .lambda = .1),
  preProc = c("center", "scale"),
  trControl = trainControl(method = "none"),
  sbfControl = sbfControl(functions = caretSBF, method = 'cv', number = 10) 
)

This would run 10 outer folds and fit a single glmnet model to each, using only a feature subset. You could also specify some number of cv folds for trControl and a parameter grid to do training on inner folds.

Solved – Optimizing probability thresholds in a glm model in caret

You can vary the probability cutoff values over the range 0 to 1, and check the optimum cut off for maximum accuracy:

    logmodel <- glm(y~., data = data, family = binomial)

considering logmodel as your fitted model, which outputs the probabilities, use a function that calculates the accuracy of classification for each cut-off value like below

cutoffs <- seq(0.1,0.9,0.1)
accuracy <- NULL
for (i in seq(along = cutoffs)){
    prediction <- ifelse(logmodel$fitted.values >= cutoffs[i], 1, 0) #Predicting for cut-off
accuracy <- c(accuracy,length(which(data$y ==prediction))/length(prediction)*100)
}

And then you can visually explore the cutoff vs probability by plotting

plot(cutoffs, accuracy, pch =19,type='b',col= "steelblue",
     main ="Logistic Regression", xlab="Cutoff Level", ylab = "Accuracy %")

This will be the type of output:(I've added some ablines)

Best Answer

Related Solutions

Solved – Feature selection + classification in Caret

Solved – Optimizing probability thresholds in a glm model in caret

Related Question