Distance Functions – When to Use Weighted Euclidean Distance and Determine Weights

distance-functions

I have a set of data where each data consist of $n$ different measures. For each measure, I have a benchmark value. I would like to know how close each data is to the benchmark value.

I thought of using the Weighted Euclidean Distance like this:

$\hspace{0.5in} d_{x,b}=\left( \sum_{i=1}^{n}w_i(x_i-b_i)^2)\right)^{1/2} $

where

$\hspace{0.5in}x_i$ is the value of the i-th measure for the particular data

$\hspace{0.5in}b_i$ is the corresponding benchmark value for that measure.

$\hspace{0.5in} w_i$ is the value of the weight between I will attach to the i-th measure subject to the following:

$\hspace{1in}0<w_i<1$ and $\sum_{i=1}^{n}1$

However, base on this document, I found out that the weight to use is the reciprocal of i-th measure's variance. I don't think this sort of weighting will account for the importance that I will attach to each measure.

Therefore:

  1. Are there methods to come up with a set of weights that reflects the observer's relative importance of a measure or can the observer assign any arbitrary values for the weights?

  2. Is it appropriate to use the Weighted Euclidean Distance to solve this problem?

Best Answer

Weights for standardisation

The setup you have is a variant of Mahalanobis distance. So when $w$ is the reciprocal of each measurement's variance you are effectively putting all the measurements on the same scale. This implies you think that the variation in each is equally 'important' but that some are measured in units that are not immediately comparable.

Weights for importance

You are free to put anything you like as weights, including measures of 'importance' (although you may want to standardise before importance weighting if the measurement units differ).

An example may help clarify the issues: consider the idea of estimating ideological 'distances' between political actors. In this application $x$ and $b$ might be the positions of two actors on the $i$-th issue, and $w_i$ the salience of that issue. For example, $b_i$ might be the status quo position on some dimension, from which various actor's positions differ. In this application one would certainly prefer to measure rather than assert both salience and position. Either way, large weights will make differences on non-salient issues have less effect on the overall distance between actors if they are computed according to your first equation. Notice also that in this version we implicitly assume no relevant covariance among positions, which is a fairly strong claim.

Focusing now on question 2: In the application I just described the justification for the weighting and distances grounds out in game theoretic assumptions about transitive preference structures and suchlike. Ultimately, these are the only reasons it is 'appropriate' to compute distances this way. Without them we've just got a bunch of numbers that obey the triangle inequality.

Weights as implicit measurement

On the covariance theme, it might be helpful to think of your problem as one of identifying the relevant subspace within which distances make substantive sense, on the assumption that many of the measurements you have actually measure similar things. A measurement model, e.g. factor analysis, would project everything via weighted combination into a common space wherein distances could be computed. But, again, we'd have to know the context of your research to say whether that would make sense.