Solved – Finding weights for variables in kNN

k nearest neighbourmachine learningoptimizationr

I'm using euclidean distance for kNN. I have labeled data, I have took logarithm of some variables to make them look more like normaly distributed and scaled them all. And now I would like to multiply some variables by weights, then compute euclidean distance and train kNN. But how to find those weights ? My idea is to determine centers of classes this going to be set C, and then make optimization of kNN on set C by random search, I think that I can't do it on subset of training set, because it size would by to high or too small for accurate representation/sampling of dataset

Do you have any other ideas ?
I don't think that changing parameters k and l going to have the same approach as mine or mayby does it ?

Best Answer

Hastie and Tibshirani's paper on Discriminative Adaptive Nearest Neighbour Classification would be a good place to start.

A simple approach would be to choose the weights to minimise the leave-one-out error rate. However one of the advantages of kNN is that, being a relatively simple method, it is usually quite easy to avoid over-fitting (basically just need to choose k), and this advantage is easily lost if you try to tune the distance metric, so it may well make the performance of the model worse rather than better.

Related Solutions

Solved – How to compute a measure of distance between sites with continuous variables

I tried this and got different results (as expected) from the distances of a data frame and its transpose:

library(ade4)
x1 <- rnorm(10, 2, 1)
x2 <- rnorm(10,1,1)
dframe <- cbind(x1,x2)
dist1 <- dist.quant(dframe, 1, diag = TRUE, upper = TRUE)
dist1
dist2 <- dist.quant(t(dframe),1, diag = TRUE, upper = TRUE)
dist2

dist2 gives a single distance (between x1 and x2). dist1 gives a $10\times10$ matrix (since I put upper = TRUE and diagonal = TRUE)

Solved – Combine two, three, (n) metrics for calculating dissimilarity matrix

If you look at Gower in detail, you'll notice it uses Manhattan on numerical attributes. You can easily modify it to use Euclidean.

However, feature weighting will have a major impact on the results. There are a few approaches for supervised weighting of features IIRC, but I have not yet seen anything reliable for automatic weighting that does not require labels.

So in the end, you will have the problem that your distance function looks something (for Euclidean) like this:

$$ d(x,y) = \sqrt{\sum_{i\in \text{numerical}} \omega_i(x_i-y_i)^2} + \sum_{i \in \text{categorical}} \omega_i \mathbb{1}_{x_i == y_i} $$

where you will face the challenge of choosing all the $\omega_i$ weights.

Best Answer

Related Solutions

Solved – How to compute a measure of distance between sites with continuous variables

Solved – Combine two, three, (n) metrics for calculating dissimilarity matrix

Related Question