Solved – Do the properties of Pearson’s chi-squared test for independence hold true for continuous PDFs

chi-squared-testcontinuous datadensity functionindependencep-value

In probabilist statistics, the properties of a discrete Pearson's chi-squared test hold that:

\begin{aligned}
\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} {(O_{i,j} – E_{i,j})^2 \over E_{i,j}}
\end{aligned}

Where $\chi^2$ is our (Pearson's) cumulative test statistic, $O_{i,j}$ is an observed frequency in a given contingency table,
$E_{i,j}$ is the an expected (theoretical) frequency for the same, and $r$ and $c$ are the number of rows and columns in the table, respectively.

Assuming this is true, a Pearson's chi-squared test for independence states:

\begin{aligned}
E_{i,j}=\frac{\sum_{k=1}^c O_{i,k}\ \sum_{k=1}^r O_{k,j}}{N}
\end{aligned}

… which is a mathematically terse way of saying that, for all observations in $O_{i,j}$, we expect a frequency derived from the row and column values.

I'm currently looking at generalizing this test for continuous variables estimable by PDF. Phrased differently: given the joint probability distribution of an intersecting function, I would like to be able to perform a chi-squared test for independence.

My intuition states that, for the continuous case, the first formula can be generalized as:

\begin{aligned}
\chi^2 = \int_I \int_J {(\hat{O_{i,j}} – \hat{E_{i,j}})^2 \over \hat{E_{i,j}}}\ di\ dj
\end{aligned}

Where $\chi^2$ remains our cumulative test statistic, $\hat{O_{i,j}}$ is an observed frequency off the joint probability distribution, given $i \in I$ and $j \in J$,
$\hat{E_{i,j}}$ remains the expected frequency, and $r$ and $c$ briefly disappear. We then define $\hat{E_{i,j}}$ to a comparable function that looks something like:

\begin{aligned}
\hat{E_{i,j}}=\frac{\int_c \hat{O_{i,c}}\ dc\ \int_r \hat{O_{r,j}}\ dr}{2}
\end{aligned}

… where $r$ and $c$ are the partitions selected during integration by, for example, an adaptive quadrature. Note further that this relationship breaks down for infinite bounds, as $r$ or $c$ approach infinity (or conversely, as the value of any evaluation of $\hat{E_{i,j}}$ approaches zero).

This is intuitive only in the sense that these calculations are made practical by tools like SciPy, and I'm leaving out a lot of detail so as not to bore or confuse the reader.

But, I am curious: is method, as tersely presented, correct? I can find no guidance contradicting it. The goal of such a thing is to make terms expressible as continuous PDFs smoothed by a Gaussian kernel possible, for unorganized ordinal and nominal data.

Thank you in advance.

Best Answer

In general, because continuous data usually has a dispersion attribute, the Pearson-Chi Square test doesn't make any sense. This is because the expected value in the denominator is attributed to the mean-variance relationship of categorical data. If you're looking for a generalized version of the Pearson chi-squared test statistic, you must specify what attribute of the chi-square statistic is interesting to you. Because it is the score test for a logistic regression model, you could consider using a score test for any other regression model, like linear regression, or use a different test statistic like those from the Wald or Likelihood ratio tests. This addresses the inferential aspect of the test, whether the row and column variables are independent.

Alternately, you may be interested in this test statistic as a measure of goodness of fit or calibration for a predictive model. If you can sensibly bin the data into distinct groups, you can still use the chi-square test statistic, or a kappa statistic measuring agreement between observed and predicted data, especially useful with more than 2 distinct groups in the outcome. I would prefer using the continuous scale and using the MSE as a measure of predictive accuracy, more preferrably the cross-validated MSE of your predictive model. This can also be standardized into a measure like the R^2 value which has associated significance tests as well.

Related Question