I have a data set that's 200k rows X 50 columns. I'm trying to use a knn
model on it but there is huge variance in performance depending on which variables are used (i.e., rsqd
ranges from .01 (using all variables) to .98 (using only 5 variables)).
This kind of compounds my problem as now I need to determine k
and which variables to use.
Is there a package in R that helps with selecting variables for a knn
model, while tuning k
? I've looked at rfe()
in caret
but it seems to only be built for linear regression, randomforest
, naive bayes, etc but no knn
.
As an aside, I've tried manually building a loop to use the caret train function like this:
for(i in 2:50){
knnFit <- train(x[,i],y,...) ## trains model using single variable
}
My problem is that knnFit$results
prints all of the results and knnFit$bestTune
only prints the final parameter of k
.
> data1 <- data.frame(col1=runif(20), col2=runif(20), col3=runif(20), col4=runif(20), col5=runif(20))
> bootControl <- trainControl(number = 1)
> knnGrid <- expand.grid(.k=c(2:5))
> set.seed(2)
> knnFit1 <- train(data1[,-c(1)], data1[,1]
+ , method = "knn", trControl = bootControl, verbose = FALSE,
+ tuneGrid = knnGrid )
> knnFit1
20 samples
4 predictors
No pre-processing
Resampling: Bootstrap (1 reps)
Summary of sample sizes: 20
Resampling results across tuning parameters:
k RMSE Rsquared
2 0.485 0.124
3 0.54 0.369
4 0.52 0.241
5 0.528 0.232
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.
> knnFit1$results
k RMSE Rsquared RMSESD RsquaredSD
1 2 0.4845428 0.1241031 NA NA
2 3 0.5401009 0.3690569 NA NA
3 4 0.5197262 0.2410814 NA NA
4 5 0.5277939 0.2317607 NA NA
> knnFit1$bestTune
.k
1 2
I need some way to print the RMSE/rsqd/other metric for the best single performing model (i.e., just "R-Squared: .91").
Any suggestions?
Best Answer
knnFit1$results is actually a data.frame, so you can print all of the R-squared results with:
Or the R-squared result for just the best model:
Does this answer your question?