Solved – Determine the feature weights with a regression

classificationk nearest neighbourmachine learningpythonregression

I have a set of houses with their features (location, size, number of rooms, etc … and the y is the price). In the future I will have a new house without the price. My goal is to find the 20 closest houses from this one.

I am working on a k-NN algorithm and I don’t take into account the prices of the other houses since when I’ll use the algorithm to retrieve the 20 most similar houses for a given house, I won’t have the price for this house. But each feature haven’t the same impact so I want to set the feature weights.

I was thinking about using a linear regression to determine the weight of each feature. The features will be x and the price y. And then I’ll keep the coefficient of the regression (the parameters) as the weight for my k-NN algorithm.

I haven’t find this kind of method to determine the weight of a feature. Is there a reason for that ? Or could it be a good approximation ?
Do you recommend any other method to determine weights for a k-NN algorithm like this ?

Any input will be much appreciated !

Best Answer

I think you are on the right track (feature weights make sense).

I am assuming that the goal here is to find the 20 neighbors (based on the features) with the closest price.

One thing to point out right away is that you can test the model with cross validation: take a hold out test sample and perform the nearest neighbor search over the remaining training sample. That way you can empirically determine whether a given set of feature weights leads to finding good neighbors. This should be your approach to determining the best weights.

You can then try your idea of using linear regression and see how it performs vs not setting any feature weights. Assuming the relationship is truly linear, I think this should perform well (you should standardize the features first though). I would also try just using simple correlations of the features with the DV.

A more sophisticated approach is to use something like boosted trees or random forest instead (these are popular for determining feature importance).