[Math] Distance or Similarity between matrices that are not the same size

linear algebramatrices

I have many matrices that have different size. Specifically, those matrices have the same number of rows but vary in the number of column.

In another word, I have matrices $A_1,\dots,A_n$ where $A_i\in R^{n*k}$, $k$ is a constant, $n \in [\min,\max], \min, \max \in N^+$ and $\max>\min$.

Is there any method to calculate the distance or similarity among those matrices?

What are the advantages and disadvantages of those methods?

Or just give me a hint where to find the reference to learn.

I think I could take each row as a vector and calculate the cosine similarity of 2 vectors that come from 2 different matrices. It's kind of like distance matrix.

But I discard this way because I think this way split my matrix and I want my matrix to be an entire entity that can be applied to similarity calculation.

Thank you all.

Best Answer

Thanks for those kindly person answered or commented on my question. It's helpful.

I find 2 ways to solve my problem.

1.The RV coefficient.

Take each column of the matrix as an independent realization of a random vector. So, if I want to calculate matrix $A_1$ and $A_2$, where $A_1 \in R^{n*k}$,$A_2 \in R^{m*k}$, $m,n \in N^+$, I turn this problem into calculating the dependence of two random vectors $\mathbf{a_1}$, and $\mathbf{a_2}$, where $\mathbf{a_1} \in R^n$, $\mathbf{a_2} \in R^m$. and $A_{1} \in R^{n*k}$ ,$A_{2} \in R^{m*k}$ represent k independent realizations of the random vectors and are assumed to be centered.

The correlation coefficient is defined as following: $$ RV(X,Y)=\frac{tr(XX^{'}YY^{'})}{\sqrt{tr(XX^{'})^2tr(YY^{'})^2}}$$ substitute $X= A_{1}^{'}$, $Y= A_{2}^{'}$, then get the linear dependency.

However, this efficient can only measure the linear dependency of 2 random vectors, so even if the efficient equals zero, you can only say 2 vectors have no linear relationship between each other.

2.The dCov efficient
This efficient can be applied to two matrices with different size of both row and column.

Definition of the empirical distance covariance: $$ dCov_n^{2}(X,Y)=\frac{1}{n^{2}} \sum_{i,j=1}^{n} (d_{ij}^X-d_{i.}^{X}-d_{.j}^{X}+d_{..}^{X})(d_{ij}^Y-d_{i.}^{Y}-d_{.j}^{Y}+d_{..}^{Y}) $$

where $d_{ij}$ is the Euclidean distance between sample $i$ and $j$ of random vector $\mathbf{a_i}$, $d_{i.}= \frac{1}{n}\sum_{j=1}^{n}d_{ij}$, $d_{.j}= \frac{1}{n}\sum_{i=1}^{n}d_{ij}$, $d_{..}= \frac{1}{n^2}\sum_{i,j=1}^{n}d_{ij}$.

The empirical distance correlation: $$dCor_n^{2}(X,Y)=\frac{dCov_n^{2}(X,Y)}{\sqrt{dCov_n^{2}(X,X)dCov_n^{2}(Y,Y)}}$$

I used the $dCor_n^{2}$ to measure the similarity and it works better than using the Euclidean distance when the matrices are the same size.

References:

  1. Josse, J. and Holmes, S. (2013). Measures of dependence between random vectors and tests of independence. Literature review. arXiv preprint arXiv:1307.7383. http://arxiv.org/abs/1307.7383.

  2. Székely G J, Rizzo M L, Bakirov N K. Measuring and testing dependence by correlation of distances[J]. The Annals of Statistics, 2007, 35(6): 2769-2794.

Related Question