Given I have classified my inputs using R's built-in knn():
data <- read.csv(...)
data.training <- 80% of data, excluding Class column
data.trainLabels <- the Class columns excluded in data.training
data.test <- other 20% of data, excluding Class column
data.testLabels <- the Class columns excluded in data.test
...
data_pred <- knn(train=data.training, test=data.test, cl=data.trainLabels, k=3)
I can see the accuracy of my predictions by comparing data_pred
with data.testLabels
, and it's not bad: 85-90% accuracy.
I want to save the model used in knn()
so it can be loaded later to predict new data. For instance, I have 2 sets of classified data: one I have now, and one my professor has. I break my data into data.training
and data.test
so that I can perform n-fold CV on it. How can I get the kNN model produced by my set so it can blindly predict the results of my professor's data set?
I saw how to use save()
, load()
, and predict()
in this answer for ln()
. But according to this answer, knn might not have a model?
I can't connect the dots. Is it not possible to get a model from knn? My professor asked to "provide specification for the model selected". Do I need to use another algorithm to classify this data?
Best Answer
The kNN algorithm does not do any explicit training, so actually there is no model to be saved. Let's recall what
knn
does: given a parameter $k$ and a set of training pairs $(\mathbf{x}_i,y_i)\in\mathbb{R}^{d+1}$, $i=1,\dots,n$, to classify any new vector of features $\mathbf{x}\in\mathbb{R}^d$ we find $k$ feature vectors $\mathbf{x}_i$ from the training set that are closest to $\mathbf{x}$ (in, say, the Euclidean distance) and assign $\mathbf{x}$ the most commonly found class among the classes $y_i$ that correspond to the nearest $\mathbf{x}_i$.Hence, all you need to classify any new $\mathbf{x}\in\mathbb{R}^d$ (just like those in your test set or those that your professor has) all you need is: 1) parameter $k$ (you fixed it to 3, but more generally it could be a parameter to optimize the classification accuracy), 2) any other parameters such as the distance function, 3) the training set. Thus, in your case you would need to again run
where everything is defined the same as before except now
data.test
is this second dataset that your professor has.