Solved – How to compare joint distribution to product of marginal distributions

density functionhistogramjoint distributionMATLABprobability

I have two finite-sampled signals, $x_1$ and $x_2$, and I want to check for statistical independence.

I know that for two statistically independent signals, their joint probability distribution is a product of the two marginal distributions.

I have been advised to use histograms in order to approximate the distributions. Here's a small example.

x1 = rand(1, 50);
x2 = randn(1, 50);
n1 = hist(x1);
n2 = hist(x2);
n3 = hist3([x1' x2']);

Since I am using the default number of bins, n1 and n2 are 10-element vectors, and n3 is a 10×10 matrix.

My question is this: How do I check whether n3 is in fact a product of n1 and n2?

Do I use an outer product? And if I do, should I use x1'*x2 or x1*x2'? And why?

Also, I have noticed that hist returns the number of elements (frequency) of elements in each bin? Should this be normalized in any way? (I haven't exactly understood how hist3 works either..)

Thank you very much for your help. I'm really new to statistics so some explanatory answers would really help.

Best Answer

Assuming that the theoretical distributions of $x_1$ and $x_2$ are not known, a naive algorithm for determining independence would be as follows:

Define $x_{1,2}$ to be the set of all co-occurences of values from $x_1$ and $x_2$. For example, if $x_1 = { 1, 2, 2 }$ and $x_2 = { 3, 6, 5}$, the set of co-occurences would be $\{(1,3), (1, 6), (1, 5) , (2, 3), (2,6), (2,5), (2, 3), (2,6), (2,5))\}$.

  1. Estimate the probability density functions (PDF's) of $x_1$, $x_2$ and $x_{1,2}$, denoted as $P_{x_1}$, $P_{x_2}$ and $P_{x_{1,2}}$.
  2. Compute the mean-square error $y=sqrt(sum(P_{x_{1,2}}(y_1,y_2) - P_{x_1}(y_1) * P_{x_2}(y_2))^2)$, where $(y_1,y_2)$ takes the values of each pair in $x_{1,2}$.
  3. if $y$ is close to zero, it means that $x_1$ and $x_2$ are independent.

A simple way to estimate a PDF from a sample is to compute the sample's histogram and then to normalize it so that the integral of the PDF sums to 1. Practically, that means that you have to divide the bin counts of the histogram by the factor $h * sum(n)$ where $h$ is the bin width and $n$ is the histogram vector.

Note that step 3 of this algorithm requires the user to specify a threshold for deciding whether the signals are independent.