Solved – R knn variable selection

caretfeature selectionk nearest neighbourr

I have a data set that's 200k rows X 50 columns. I'm trying to use a knn model on it but there is huge variance in performance depending on which variables are used (i.e., rsqd ranges from .01 (using all variables) to .98 (using only 5 variables)).

This kind of compounds my problem as now I need to determine k and which variables to use.

Is there a package in R that helps with selecting variables for a knn model, while tuning k? I've looked at rfe() in caret but it seems to only be built for linear regression, randomforest, naive bayes, etc but no knn.

As an aside, I've tried manually building a loop to use the caret train function like this:

for(i in 2:50){
knnFit <- train(x[,i],y,...) ## trains model using single variable
}

My problem is that knnFit$results prints all of the results and knnFit$bestTune only prints the final parameter of k.

> data1 <- data.frame(col1=runif(20), col2=runif(20), col3=runif(20), col4=runif(20), col5=runif(20))
> bootControl <- trainControl(number = 1)
> knnGrid <- expand.grid(.k=c(2:5))
> set.seed(2)
> knnFit1 <- train(data1[,-c(1)], data1[,1]
+ , method = "knn", trControl = bootControl, verbose = FALSE,
+ tuneGrid = knnGrid )
> knnFit1 
20 samples
 4 predictors

No pre-processing
Resampling: Bootstrap (1 reps) 

Summary of sample sizes: 20 

Resampling results across tuning parameters:

  k  RMSE   Rsquared
  2  0.485  0.124   
  3  0.54   0.369   
  4  0.52   0.241   
  5  0.528  0.232   

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was k = 2. 

> knnFit1$results
      k      RMSE  Rsquared RMSESD RsquaredSD
    1 2 0.4845428 0.1241031     NA         NA
    2 3 0.5401009 0.3690569     NA         NA
    3 4 0.5197262 0.2410814     NA         NA
    4 5 0.5277939 0.2317607     NA         NA

> knnFit1$bestTune
      .k
    1  2

I need some way to print the RMSE/rsqd/other metric for the best single performing model (i.e., just "R-Squared: .91").

Any suggestions?

Best Answer

knnFit1$results is actually a data.frame, so you can print all of the R-squared results with:

knnFit1$results$Rsquared

Or the R-squared result for just the best model:

knnFit1.sorted <- results[order(results$Rsquared),]
knnFit1.sorted[1,'Rsquared']

Does this answer your question?

Related Question