I need help understanding chi2 independence test scipy.stats.chi2_contingency.

Let's assume I have two samples (of different sizes) of a categorical variable with 3 possible outcomes (1, 2, 3):

Counts of the outcomes for the two samples are as follows:

```
Sample 1: {1: 1000, 2:1000, 3:2000}
Sample 2: {1: 2000, 2:2000, 3:4000}
```

They are clearly dependent and come from the same distribution.

Sample sizes are fairly large, so we should be able to show statistical significance of this.

So, I compute the chi2 statistic and p-value for the test of independence of the observed frequencies in the contingency table.

My contingency table is:

```
[[1000, 1000, 2000],
[2000, 2000, 4000]]
```

and the output of the chi2 independence test (from scipy) for this case is:

```
(0.0,
1.0,
2,
array([[1000., 1000., 2000.],
[2000., 2000., 4000.]]))
```

According to Wikipedia:

For the test of independence, also known as the test of homogeneity, a

chi-squared probability of less than or equal to 0.05 (or the

chi-squared statistic being at or larger than the 0.05 critical point)

is commonly interpreted by applied workers as justification for

rejecting the null hypothesis that the row variable is independent of

the column variable.[4] The alternative hypothesis corresponds to the

variables having an association or relationship where the structure of

this relationship is not specified.

So, p-value is 1.0, which means that we don't reject null hypothesis that the occurrence of outcomes for the two samples is independent.

Let's consider two other (slightly different) samples:

```
Sample 1: {1: 900, 2: 900, 3: 2100}
Sample 2: {1: 2100, 2: 2100, 3: 3900}
```

For these samples, contingency table is:

```
[[900, 900, 2100],
[2100, 2100, 3900]]
```

And the chi2 independence test result is now:

```
(34.18803418803419,
3.7684495236693435e-08,
2,
array([[ 975., 975., 1950.],
[2025., 2025., 4050.]]))
```

Now we have very small p-value which means that we reject null hypothesis and that there is a relationship between the occurrence of outcomes for the two samples.

My question is: How is this possible?

It seems I don't understand something with the chi2 independence test.

To me, the first two samples are clearly from the same distribution.

Their p-value for chi2 independence test should be close to 0.

For the second samples, they are similar but p-value should be much higher than for the first two samples.

According to my understanding of the chi2 independence test is that the p-value should be smaller if counts for the two samples are more similar and higher if there is more mismatch.

Where am I wrong?

For more context, I want to apply chi2 independence test in a random split into test and control groups to test if the two groups are not statistically different with respect to a categorical variable.

I generate splits into test and control groups until I find one that has p-value < 0.05 which should prove that the two groups are similar (with 95% significance level).

Maybe chi2 independence test is not the right statistical tool for this?

## Best Answer

You seem to have confusion about the meaning of independence. Under your first example table where the second sample merely doubles the counts of the first, you write "They are clearly dependent and come from the same distribution." This statement is only half-right - the samples are drawn from the same distribution, but they are drawn

independentlyof each other. The choice of sample has no bearing on the observed distribution, it is the same in both samples. The distribution isnotdependent on which sample you observe, therefore the Sample variable is independent of the Outcome variable. The chi-squared p-value of 1 confirms this, by failing to reject the null hypothesis that the row and column variables are independent. In this case, Sample and Outcome are independent, you will not observe a different distribution of Outcomes no matter what Sample you pick.