Solved – Can you derive variable importance from a nearest neighbor algorithm

k nearest neighbour

While the helpful tooltip warned me this question was subjective, I don't think it is. It should be fairly objective to state from a theoretical perspective whether or not you can establish the importance of any given feature in a KNN type situation.

I do not think you can, since if the data are properly scaled, it will merely be the points that are all close without regards to any given variable, they should all be equally useful, in my understanding, for determining which points are neighbors in hyper-dimensional space. Since if all k of the points a new point is close to are in one class, it doesn't matter which of the variables it was close to, just that it was close to enough of them in the p-dimensional space to be a neighbor.

Is my understanding correct? Or could you somehow recover which variables had the largest absolute effect on distance (this might just be the standard deviation of the variables…)/which points were labeled neighbors?

I ask because my non-data scientist boss asked me to say which features were most important to my KNN results and I want to be certain I am correct in saying it is not reasonably possible (other than say, running a regression on the knn results and using those second-hand importance values).

Best Answer

You are right, KNN does not provide a prediction for the importance or coefficients of variables. I think the alternative you suggested, running another model like a regression (or a random-forest) to calculate the coefficients is the more logical approach. A similar question was already asked at stackexchange

Related Question