Solved – Alternative to Chi-squared test to check if categorical distribution in two sets are the same

binomial distributioncategorical datachi-squared-testdistributionsequivalence

I have expected frequencies in each category as shown below:

So my initial data: category – and number of observation of it.

There are too many observations, this why:

As the sample size grows the null hypothesis will be rejected and the
p value goes to zero for any small but nonzero deviation from the null
hypothesis. With counts, i.e. total number of observations, of more
than 50,000 the proper hypothesis test will most likely reject even
small differences that are statistically significant but irrelevant in
applications.

Anyway I have applied a Chi-squared test to this K-2 contingency, and as a result get a p_value = 6.3723954051318158e-126 . This is not so bad considering that applying this test to completely irrelevant data sets – the p value will be zero.

Using F-test as I know I will get the same result.

Another approach that comes to mind is applying a binomial test. I did it this way:

 stats.binom_test(1500, store.answered.sum(), 0.0233, alternative = 'two-sided')  
# 1500 observed amount, store.answered.sum() - sum over all observations, 0.0233 - expected frequency of that label

P_value = 0.00023472778370252812. The result is better, because we don't want to reject null hypothesis. However here is another point, that we have to keep in mind:

One of the underlying assumptions is that all observations are drawn
independently from the same distribution. This will not hold if there
is correlation within a store or heterogeneity in the
probabilities/distribution. In those cases the variance assumption of
the multinomial/binomial/Poisson model would not hold and we get
either under or over dispersion

Generally I can make such an assumption, but I'm not sure.

So my question is, how can we check that data distribution within these data sets are the same? My final goal is to check that the second smaller data set is not shifted (English term may be different) to the bigger original one.

Best Answer

Tests for equivalence test the null hypothesis that quantities are different by a threshold of relevance—the smallest value that researchers, or regulators in the case of, for example, the FDA, consider to be meaningful—and rejection of this null hypothesis is to conclude that the quantities are equivalent within the bounds of the relevance threshold.¹

One form of equivalence tests is the two one-sided tests (TOST) approach, where (typically) two one-sided t or z tests are constructed around the relevance threshold in the upper and lower directions… rejecting both one-sided tests implies that the true value ought to be inferred to lying within the equivalence range. However, TOST, why relatively straightforward to compute, and widely used, ignores an accurate accounting of power to reject by ignoring the non-centrality parameters that come into play in their test statistics. By contrast uniformly most powerful (UMP) tests for equivalence account for such, and provide optimal statistical power to reject equivalence null hypotheses.

Chapter 9, section 9.2 of Welleck's Testing Statistical Hypotheses Of Equivalence And Noninferiority, Second Edition provides a uniformly most powerful test for equivalence for contingency table $\chi^{2}$ tests (or, test for 'collapsability' as the contingency table equivalence testing literature has it). The math for constructing the UMP contingency table test statistic is a tad hairy (by which I mean I haven't learned it yet :), but Welleck includes an R macro for the test and an example application.

Finally, I will note that only testing for difference, or only testing for equivalence implies—without explicit a priori power analysis and justification of minimum relevant effect size—committing to confirmation bias by privileging the direction of evidence/burden of proof. A savvy way to counter that commitment in a frequentist analytic context is to conduct both tests for relevance and tests for equivalence, and draw conclusions accordingly (see the [tost] tag info page for more details on this point).

¹ Relevance thresholds can be asymmetric: closer to 'no difference' in one direction than the other.

Related Solutions

Solved – Proper use and interpretation of chi-squared test

I hardly read Python but what I understand from the code of your chi2testTwoSamples function is that you consider the second sample as "expected values"; that’s not the way it works!

Either you compare the four observed values $n_0, n_1, n_2, n_3$ obtained from counts of a single random sample of $N$ elements in $\{0, 1, 2, 3\}$ with the expected values $e_0 = e_1 = e_2 = e_3 = {1\over 4} N$ ;
or you have two samples of total size $N$ (resp. $M$) with counts $n_0, n_1, n_2, n_3$ (resp. $m_0, m_1, m_2, m_3$), and assuming that both samples come from the same distribution you infer the frequencies $f_0, f_1, f_2, f_3$ of the four outcomes by $f_i = {1 \over N+M} (n_i + m_i)$ and expected values are $N f_i$ (resp. $M f_i$).

In both cases, the result should follow a $\chi^2$ distribution with $3$ degrees of freedom. The computation of the number of degrees of freedom is different in the two cases, it’s not every time as simple as number of categories - 1...

Testing Multinomial Samples for Same Distribution

You correctly performed a $\chi^2$-test of independence, so the only problem is in the formulation of its hypotheses and the interpretation of the test result:

The $\chi^2$-test of independence tests the null hypothesis "The two color distributions are equal" versus the working hypothesis of any difference. The p value is smaller than the prespecified level $\alpha$, so you reject the null hypothesis and claim with about $(1-\alpha)\cdot 100\%$ confidence that the colors are differently distributed between urns.

The term "independence"-test is sometimes a bit confusing but it is more clear if you consider the "raw" data behind the contingency table:

Color   Urn
Blue      1
Blue      2
Green     2
Red       1
Blue      1
...

The null hypothesis that the variable "Urn" is independent of the random variable "Color" is equivalent to the null hypothesis stated above. So it's not about independence of the two color distributions but about independence of color and urn.

Note that a large p value wouldn't mean that the color distributions were equal. This would be much harder to show by "classic" statistical methods.

Best Answer

Related Solutions

Solved – Proper use and interpretation of chi-squared test

Testing Multinomial Samples for Same Distribution

Related Question