Solved – Similarity measure between groups

distancedistance-functionsmachine learningsimilarities

Imagine there is a group of people numbered 1-100, each of which have a few numerical attributes e.g. height, weight, age.

There is a small sub-group (A) which consists of 10 people from the bigger group, say 1-10. The group does not necessarily have anything in common in terms of the attributes.

What I would like to do is find a new sub-group (B), also 10 people among the rest of the group (11-100) that are most similar to the group (A) in terms of the attributes height, weight, age.

What approach would you take on this?

I thought of doing some euclidian distance matrix person-person, and then matching them up 1 to 1 – but is that really the way to go?

BACKSTORY: In lack of an actual experiment – group A is a set of users who did something and I want to have a group B which were similar to A but did not do that something, i.e. a "control"-group of sorts to estimate the effect of what A did.

Best Answer

The literature on hierarchical clustering deals with similarity measures between groups. The most popular measures of group similarity are perhaps the single linkage, complete linkage, and average linkage.

Single linkage defines the group distance according to the two nearest members. Complete linkage uses the two most distant members. Average linkage uses the distance between the group averages.

Related Solutions

Solved – How to measure distance for features with different scales

A very common solution for this very common problem (ie, over-weighting variables) is to standardize your data.

To do this, you just perform two successive column-wise operations on your data:

subtract the mean and
divide by the standard deviation

The rationale of these two operations is to ensure the values have zero-mean (when subtracting the mean in the numerator) and unit-variance (by dividing with the standard deviation in the denominator).

For instance, in NumPy:

>>> # first create a small data matrix comprised of three variables 
>>> # having three different 'scales' (means and variances)

>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A   # the pre-standardized data
    array([[  1.753,  37.809,   1.181],
           [  1.386,   8.333,   0.235],
           [  2.827,  40.5  ,   0.625],
           [  5.516,  47.202,   0.183],
           [  0.599,  27.017,   1.054],
           [  8.918,  35.398,   1.602]])

>>> # mean center the data (columnwise)
>>> A -= NP.mean(A, axis=0)
>>> A
    array([[ -1.747,   5.099,   0.368],
           [ -2.114, -24.377,  -0.578],
           [ -0.673,   7.79 ,  -0.189],
           [  2.016,  14.493,  -0.631],
           [ -2.901,  -5.693,   0.24 ],
           [  5.418,   2.688,   0.789]])

>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
    array([[-0.606,  0.409,  0.716],
           [-0.734, -1.957, -1.125],
           [-0.233,  0.626, -0.367],
           [ 0.7  ,  1.164, -1.228],
           [-1.007, -0.457,  0.468],
           [ 1.881,  0.216,  1.536]])

Solved – Similarity measure between multiple distributions

It's more common to measure discrepancy than similarity, but some of them can be converted easily to your way around.

Possible measures of discrepancy in distribution include (but are not limited to):

Kolmogorov-Smirnov distance. This distance between cdfs (or emprical cdfs), $D$, is small when the distributions are the same and close to 1 when they're very different, so $1-D$ should have the property you seek and doesn't require the same number of observations (indeed many of these measures don't).

Bhattacharyya distance. The Bhattacharyya coefficient, to which it is related (see the article) is a measure of similarity of distributions of the form you suggest.

Information-divergence. This is not symmetric (so D(x,y) is not D(y,x), and is not a metric), but it can be made symmetric (e.g. by looking at D(x,y)+D(y,x) for example) and there are some related metric distances to this divergence measure.

Chi-square distance: A variety of related measures get this name, used for discrete data (or discretized continuous data) -
$\quad$ I'll mention one: $d(x,y) = \frac{1}{2}\sum \frac{(x_i-y_i)^2 }{ x_i+y_i }$. This, as with the other chi-square distances, requires discretization into the same set of categories for both variables, and the x's and y's are proportions of their total category counts. This distance lies between 0 and 1, and is converted to a similarity by subtracting from 1.

Best Answer

Related Solutions

Solved – How to measure distance for features with different scales

Solved – Similarity measure between multiple distributions

Related Question