Solved – Multivariate distance function in Excel

excelmatrix

I want to do a simple nearest neighbour calculation in Excel over a multivariate space to get an idea of how my data clusters. I have a set of data points $\{X_1, X_2, \ldots X_n\}$ and a set of possible cluster centers $\{M_1, M_2, \ldots, M_m\}$. Initially, I propose to simply use a weight vector $w$, and compute the scalar product of $w$ and $[X_i – M_j]$ to get the distance of point $i$ to cluster centre $j$. This is all very simple… but completely unwieldy if $n$ and/or $m$ and/or the dimensionality of my data points is large. So is there any (simple) way of hiding all the matrix calculations and producing a (matrix/vector) formula that produces the (scalar) distance without using temporary arrays?

Best Answer

Excel supports matrix operations.

In this case, do the following:

Put the data points in an $n$ by $p$ array where $p$ is the dimensionality of the space. Call this array X.
Put the cluster centers in an $m$ by $p$ array and call it M.
Put the weights into a $1$ by $p$ array and call it W.
Create a range for the $n$ by $m$ calculation. Bound it on the left with the sequence $1,2,\ldots, n$, going down the column. To be concrete, let's suppose this sequence is in cells A2, A3, etc. Bound it above with the sequence $1,2,\ldots, m$. To be concrete, let's suppose this is in cells B1, C1, etc. Thus the upper left corner of the results will in cell B2.
Select the top cell in the result array (B2). In the formula bar type

    =MMULT(W, ABS(TRANSPOSE(OFFSET(X, $A2-1, 0, 1) - OFFSET(M, B$1-1, 0, 1))))

and press Enter. Drag this formula throughout the entire array, first to the right across all $m$ cells of the top row, and then after selecting the entire top row, down to include all $n$ rows. Judicious use of "\$" in the formula causes it to update appropriately when dragged. (This illustrates how to compute an outer product in Excel.)

This formula does the following:

OFFSET(X, $A2-1, 0, 1) uses the entries in the left column (column A) to index into the rows of array X.
OFFSET(M, B$1-1, 0, 1) uses the entries in the top row (row 1) to index into the rows of array M.
- subtracts the designated row of M from the designated row of X, yielding a $1$ by $p$ array.
TRANSPOSE converts that result to a $p$ by $1$ array.
MMULT performs the matrix multiplication of the $1$ by $p$ array W by the $p$ by $1$ array computed in the preceding step, producing a $1$ by $1$ array: that is, a number (the distance).

Related Solutions

Solved – Quartiles in Excel

Typically, a rank $r$ (between $1$ and $n$ for $n$ data) is converted to a percent $p$ via the formula

$$p = 100\frac{r-\alpha}{n+1-2\alpha}$$

for some predetermined "plotting position" $\alpha$ between $0$ and $1$, inclusive. Solving for $r$ in terms of $p$ gives

$$r = (n+1-2\alpha) (p/100) + \alpha.$$

Excel has historically used $\alpha=1$ for its PERCENTILE and QUARTILE functions. The online documentation for QUARTILE.INC and QUARTILE.EXC is useless, so we have to reverse-engineer what these functions are doing.

For example, with data $(1,2,3,4,5,6,7,8,9,10)$, we have $n=10$ and $p \in \{25, 50, 75\}$ for the three quartiles. Using $\alpha=1$ in the preceding formula yields ranks of $9(0.25)+1 = 3.25$, $9(0.50)+1 = 5.5$, and $9(0.75)+1 = 7.75$, reproducing the results for QUARTILE.INC.

If instead we set $\alpha=0$ the corresponding ranks are $11(0.25) = 2.75$, $11(0.50) = 5.5$, and $11(0.75) = 8.25$, reproducing the results for QUARTILE.EXC.

Further testing on your part (I do not have a recent version of Excel) may establish the validity of my guess that these two versions of the quartile function are determined by these two (extreme) values of $\alpha$.

By the way, fractional ranks are converted into data values by means of linear interpolation. The process is explained and illustrated in my course notes at Percentiles and EDF Plots--look near the bottom of that page. There is also a link to an Excel spreadsheet illustrating the calculations.

If you would like to implement a general percentile function in Excel, here's a VBA macro to do it:

'
' Converts a percent, computed using plotting position constant A,
' into a percent appropriate for the Excel Percentile() and
' Quartile() functions.  (The default value of A for Excel is 1;
' most values in use are between 0 and 0.5.)
'
Public Function PercentileA(P As Double, N As Integer, A As Double) As Double
    If N < 1 Or A < 0# Or A > 1# Or P < 0# Or P > 1# Then
        Exit Function
    End If
    If N < 2 Then
        PercentileA = 0.5
    Else
        PercentileA = ((N - 2 * A + 1) * P + A - 1) / (N - 1)
    End If
End Function

It converts a nominal percent (such as 25/100) into the percent that would cause Excel's PERCENTILE function to return the desired value. It is intended for use in cell formulas, as in =PERCENTILE(Data, PercentileA(0.25, Count(Data), 0.5)).

Solved – Clustering of set of matrices

Since you have only 25 instances, I think hierarchical clustering is the way to go.

Not much more to say, until you've tried it - it's straightforward to use with any similarity, and with 25 instances, the bad scalability of O(n^3) is also irrelevant.

Best Answer

Related Solutions

Solved – Quartiles in Excel

Solved – Clustering of set of matrices

Related Question