Solved – How to perform a chi-square test for independence on signal samples

categorical datachi-squared-testhistogramindependenceMATLAB

Let's say I have two signals $x$ and $y$, sampled $N$ times, i.e.

$$ x = [ x_{1}, x_{2}, …, x_{N} ] $$
$$ y = [ y_{1}, y_{2}, …, y_{N} ] $$

I would like to check whether $x$ and $y$ are statistically independent, with a certain level of probability.

I have been looking into the Chi-Square Test For Independence. However, since it can be applied to categorical data, I do not know how I can apply it to my signal samples.

As was suggested on this related question, once we compute the histogram for each signal, we do have categories on which a chi-squared test can be applied. But how do we use the histograms to generate the required contingency table?

For what it's worth, I am currently computing the histograms using this code:

n1 = hist(x);
n2 = hist(y);
n3 = hist3([x' y']);

Thank you for your help and suggestions.

EDIT

As an example, two signal samples would be the following:

xx = 0.2:0.2:34;
x = sin(xx);
y = randn(size(xx));

Best Answer

I agree with @rolando2, if your data are continuous, a chi-square test is not really best. I would make a scatterplot and overlay a loess line on it. I don't really know MATLAB, but googling led me to these two pages. I can tell you that in R, the code would be plot(x,y); lines(lowess(y~x)). Although it's true that variables can be dependent while being uncorrelated, you should be able to assess this via your scatterplot. That is because correlation (specifically, Pearson's product-moment correlation) is a measure of linear dependence, so for example, a perfect parabola could be uncorrelated despite being dependent, or a sine wave that ran perfectly horizontally. My point is that you would be able to recognize that the correlation statistic was failing to capture the dependence by looking at your scatterplot. The smoothed fit would make this even easier to see.

On the other hand, what the histograms are doing depends on how the bins are generated, and how well those bins capture the underlying continuous variable. Since you have the underlying continuous variable, why use a more or less bad approximation to it? If you wanted to know how to do this out of purely academic curiosity, you would dice up your continuous variable into bins somehow (perhaps based on theory, perhaps using the automatic binning algorithm in your software, or whatever) for each variable individually, ignoring the other variable. Then you would cross-tabulate your data by counting the number of cases that fell into each 2D bin. As a rule of thumb, you want at least 5 cases in each combination, although it's actually more complicated than that. Then you would simply run a standard chi-squared test for independence. For the record, I don't see any value in pursuing this course.

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Hypothesis Testing – Calculating P-value of Chi-Square Test of Independence

You seem to have confusion about the meaning of independence. Under your first example table where the second sample merely doubles the counts of the first, you write "They are clearly dependent and come from the same distribution." This statement is only half-right - the samples are drawn from the same distribution, but they are drawn independently of each other. The choice of sample has no bearing on the observed distribution, it is the same in both samples. The distribution is not dependent on which sample you observe, therefore the Sample variable is independent of the Outcome variable. The chi-squared p-value of 1 confirms this, by failing to reject the null hypothesis that the row and column variables are independent. In this case, Sample and Outcome are independent, you will not observe a different distribution of Outcomes no matter what Sample you pick.

Best Answer

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

Hypothesis Testing – Calculating P-value of Chi-Square Test of Independence

Related Question