Solved – How to get model in knn()

k nearest neighbourmachine learningpredictive-modelsr

Given I have classified my inputs using R's built-in knn():

data <- read.csv(...)
data.training <- 80% of data, excluding Class column
data.trainLabels <- the Class columns excluded in data.training

data.test <- other 20% of data, excluding Class column
data.testLabels <- the Class columns excluded in data.test
...
data_pred <- knn(train=data.training, test=data.test, cl=data.trainLabels, k=3)

I can see the accuracy of my predictions by comparing data_pred with data.testLabels, and it's not bad: 85-90% accuracy.

I want to save the model used in knn() so it can be loaded later to predict new data. For instance, I have 2 sets of classified data: one I have now, and one my professor has. I break my data into data.training and data.test so that I can perform n-fold CV on it. How can I get the kNN model produced by my set so it can blindly predict the results of my professor's data set?

I saw how to use save(), load(), and predict() in this answer for ln(). But according to this answer, knn might not have a model?

I can't connect the dots. Is it not possible to get a model from knn? My professor asked to "provide specification for the model selected". Do I need to use another algorithm to classify this data?

Best Answer

The kNN algorithm does not do any explicit training, so actually there is no model to be saved. Let's recall what knn does: given a parameter $k$ and a set of training pairs $(\mathbf{x}_i,y_i)\in\mathbb{R}^{d+1}$, $i=1,\dots,n$, to classify any new vector of features $\mathbf{x}\in\mathbb{R}^d$ we find $k$ feature vectors $\mathbf{x}_i$ from the training set that are closest to $\mathbf{x}$ (in, say, the Euclidean distance) and assign $\mathbf{x}$ the most commonly found class among the classes $y_i$ that correspond to the nearest $\mathbf{x}_i$.

Hence, all you need to classify any new $\mathbf{x}\in\mathbb{R}^d$ (just like those in your test set or those that your professor has) all you need is: 1) parameter $k$ (you fixed it to 3, but more generally it could be a parameter to optimize the classification accuracy), 2) any other parameters such as the distance function, 3) the training set. Thus, in your case you would need to again run

data_pred <- knn(train = data.training, test = data.test, cl = data.trainLabels, k = 3)

where everything is defined the same as before except now data.test is this second dataset that your professor has.