Solved – Multivariate distance function in Excel

excelmatrix

I want to do a simple nearest neighbour calculation in Excel over a multivariate space to get an idea of how my data clusters. I have a set of data points $\{X_1, X_2, \ldots X_n\}$ and a set of possible cluster centers $\{M_1, M_2, \ldots, M_m\}$. Initially, I propose to simply use a weight vector $w$, and compute the scalar product of $w$ and $[X_i – M_j]$ to get the distance of point $i$ to cluster centre $j$. This is all very simple… but completely unwieldy if $n$ and/or $m$ and/or the dimensionality of my data points is large. So is there any (simple) way of hiding all the matrix calculations and producing a (matrix/vector) formula that produces the (scalar) distance without using temporary arrays?

Best Answer

Excel supports matrix operations.

In this case, do the following:

  • Put the data points in an $n$ by $p$ array where $p$ is the dimensionality of the space. Call this array X.

  • Put the cluster centers in an $m$ by $p$ array and call it M.

  • Put the weights into a $1$ by $p$ array and call it W.

  • Create a range for the $n$ by $m$ calculation. Bound it on the left with the sequence $1,2,\ldots, n$, going down the column. To be concrete, let's suppose this sequence is in cells A2, A3, etc. Bound it above with the sequence $1,2,\ldots, m$. To be concrete, let's suppose this is in cells B1, C1, etc. Thus the upper left corner of the results will in cell B2.

  • Select the top cell in the result array (B2). In the formula bar type

    =MMULT(W, ABS(TRANSPOSE(OFFSET(X, $A2-1, 0, 1) - OFFSET(M, B$1-1, 0, 1))))

and press Enter. Drag this formula throughout the entire array, first to the right across all $m$ cells of the top row, and then after selecting the entire top row, down to include all $n$ rows. Judicious use of "\$" in the formula causes it to update appropriately when dragged. (This illustrates how to compute an outer product in Excel.)

This formula does the following:

  1. OFFSET(X, $A2-1, 0, 1) uses the entries in the left column (column A) to index into the rows of array X.

  2. OFFSET(M, B$1-1, 0, 1) uses the entries in the top row (row 1) to index into the rows of array M.

  3. - subtracts the designated row of M from the designated row of X, yielding a $1$ by $p$ array.

  4. TRANSPOSE converts that result to a $p$ by $1$ array.

  5. MMULT performs the matrix multiplication of the $1$ by $p$ array W by the $p$ by $1$ array computed in the preceding step, producing a $1$ by $1$ array: that is, a number (the distance).