I'm quite new to this StackExchange, only been a lurker till now, but my StackOverflow fellows have said you'd be the best people to ask about this.

Anyway, enough introduction. I'm using the weighted k-Nearest-Neighbours algorithm. My original data set has 37 features. I've looked into using PCA to reduce dimensionality, and I'm going to follow this method.

For simplicity's sake, let us assume that two of the new features created account for 90% of the variance and I'm only going to use these two new features. Let us call them feature 1 and feature 2 ($f_1$, $f_2$). Let us say that $f_1$ accounts for 60% of the variance and $f_2%$ accounts for 30% of the variance. I know wish to select the weights ($w_1 , w_2$) for these two features. My initial intuition is that we could correlate the variance accounted for with the weight of the feature. Therefore, I would use a weight combination of $w_1 = 0.6$ and $w_2 = 0.3$ in my k-Nearest-Neighbours algorithm.

I am well aware that there is much literature suggesting the best way to select weights would be to use a lattice type of method where we select different combinations of weights and then follow through with combination that yields the best results. I was just wondering if the intuition of weights being related to total variance accounted for was. Also, as my dataset actually requires 11 features to be used to account for 90% of the variance, I'd like to have a starting point for determining combinations of weights.

**Summary:** When using PCA as a precursor for kNN, is it possible to base the weights of features in k-NN on the total variance said features accounts for in the data?

Sorry if there are any formatting errors or if I'm breaking any protocols. Let me know if I have, and I will update the post.

## Best Answer

There are many options you could pursue, I can suggest a few. First, if you already have a training set, and assuming the training set is large enough, you could learn a distance metric instead of using PCA weights-based interpretation. See Mahalanobis Distance as an example of distance metric learning.

The main idea is that you intend to use a weighted Euclidean metric: $$ D(x_1,x_2)=\sqrt{(x_1-x_2)^TC(x_1-x_2)} $$ $$ C=diag(w_1...w_n) $$ The Mahalanobis distance is similarly defined, although it takes into account the cross correlation covariance between the variables (some of your features may be correlated) $$ D_M(x_1,x_2)=\sqrt{(x_1-x_2)^TC(x_1-x_2)} $$

where C is the covariance matrix.

Another option is instead of using PCA, which is an unsupervised method, use a supervised method, such as Class Augumented-PCA. Generally speaking, you could use any interpretable machine learning classification algorithm (gives you the weights) and use K-NN with the weights.