A very common solution for this very common problem (ie, over-weighting variables) is to standardize your data.
To do this, you just perform two successive column-wise operations on your data:
The rationale of these two operations is to ensure the values have zero-mean (when subtracting the mean in the numerator) and unit-variance (by dividing with the standard deviation in the denominator).
For instance, in NumPy:
>>> # first create a small data matrix comprised of three variables
>>> # having three different 'scales' (means and variances)
>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A # the pre-standardized data
array([[ 1.753, 37.809, 1.181],
[ 1.386, 8.333, 0.235],
[ 2.827, 40.5 , 0.625],
[ 5.516, 47.202, 0.183],
[ 0.599, 27.017, 1.054],
[ 8.918, 35.398, 1.602]])
>>> # mean center the data (columnwise)
>>> A -= NP.mean(A, axis=0)
>>> A
array([[ -1.747, 5.099, 0.368],
[ -2.114, -24.377, -0.578],
[ -0.673, 7.79 , -0.189],
[ 2.016, 14.493, -0.631],
[ -2.901, -5.693, 0.24 ],
[ 5.418, 2.688, 0.789]])
>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
array([[-0.606, 0.409, 0.716],
[-0.734, -1.957, -1.125],
[-0.233, 0.626, -0.367],
[ 0.7 , 1.164, -1.228],
[-1.007, -0.457, 0.468],
[ 1.881, 0.216, 1.536]])
It's more common to measure discrepancy than similarity, but some of them can be converted easily to your way around.
Possible measures of discrepancy in distribution include (but are not limited to):
Kolmogorov-Smirnov distance. This distance between cdfs (or emprical cdfs), $D$, is small when the distributions are the same and close to 1 when they're very different, so $1-D$ should have the property you seek and doesn't require the same number of observations (indeed many of these measures don't).
Bhattacharyya distance. The Bhattacharyya coefficient, to which it is related (see the article) is a measure of similarity of distributions of the form you suggest.
Information-divergence. This is not symmetric (so D(x,y) is not D(y,x), and is not a metric), but it can be made symmetric (e.g. by looking at D(x,y)+D(y,x) for example) and there are some related metric distances to this divergence measure.
Chi-square distance: A variety of related measures get this name, used for discrete data (or discretized continuous data) -
$\quad$ I'll mention one: $d(x,y) = \frac{1}{2}\sum \frac{(x_i-y_i)^2 }{ x_i+y_i }$. This, as with the other chi-square distances, requires discretization into the same set of categories for both variables, and the x's and y's are proportions of their total category counts. This distance lies between 0 and 1, and is converted to a similarity by subtracting from 1.
Best Answer
The literature on hierarchical clustering deals with similarity measures between groups. The most popular measures of group similarity are perhaps the single linkage, complete linkage, and average linkage.
Single linkage defines the group distance according to the two nearest members. Complete linkage uses the two most distant members. Average linkage uses the distance between the group averages.