Solved – Common statistical tests as linear models

anovacorrelationlinear modelregressiont-test

(UPDATE: I dived deeper into this and and posted the results here)

The list of named statistical tests is huge. Many of the common tests rely on inference from simple linear models, e.g. a one-sample t-test is just y = β + ε which is tested against the null model y = μ + ε i.e. that β = μ where μ is some null value – typically μ=0.

I find this to be quite a bit more instructive for teaching purposes than rote learning named models, when to use them, and their assumptions as if they had nothing to do with each other. That approach promotes does not promote understanding. However, I cannot find a good resource collecting this. I am more interested in equivalences between the underlying models rather than the method of inference from them. Although, as far as I can see, likelihood ratio tests on all these linear models yield the same results as the "classical" inference.

Here are the equivalences I've learned about so far, ignoring the error term $\varepsilon \sim \mathcal N(0, \sigma^2)$ and assuming that all null hypotheses are the absense of an effect:

One-sample t-test:
$y = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0$.

Paired-sample t-test:
$y_2-y_1 = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0$

This is identical to a one-sample t-test on pairwise differences.

Two-sample t-test:
$y = \beta_1 * x_i + \beta_0 \qquad \mathcal{H}_0: \beta_1 = 0$

where x is an indicator (0 or 1).

Pearson correlation:
$y = \beta_1 * x + \beta_0 \qquad \mathcal{H}_0: \beta_1 = 0$

Notice the similarity to a two-sample t-test which is just regression on a binary x-axis.

Spearman correlation:
$rank(y) = \beta_1 * rank(x) + \beta_0 \qquad \mathcal{H}_0: \beta_1 = 0$

This is identical to a Pearson correlation on rank-transformed x and y.

One-way ANOVA:
$y = \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_3 +… \qquad \mathcal{H}_0: \beta_1, \beta_2, \beta_3, … = \beta$

where $x_i$ are indicators selecting the relevant $\beta$ (one $x$ is 1; the others are 0). The model could probably be written in matrix form as as $Y = \beta * X$.

Two-way ANOVA:
$y = \beta_1 * X_1 + \beta_2 * X_2 + \beta_3 * X_1 * X_2 \qquad \mathcal{H}_0: \beta_3 = 0$

for two two-level factors. Here $\beta_i$ are vectors of betas where one is selected by the indicator vector $X_i$. The $\mathcal{H}_0$ shown here is the interaction effect.

Could we add more "named tests" to this list of linear models? E.g., multivariate regression, other "non-parametric" tests, binomial tests, or RM-ANOVAs?

UPDATE: questions have been asked and answered about ANOVA and t-tests as linear models here on SO. See this question and tagged related questions.

Best Answer

Not an exhaustive list but if you include generalized linear models, the scope of this problem becomes substantially larger.

For instance:

The Cochran-Armitage test of trend can be formulated by: $$E[\mbox{logit} (p) | t] = \beta_0 + \beta_1 t \qquad \mathcal{H}_0: \beta_1 = 0$$

The Pearson Chi-Square test of independence for a $p \times k$ contingency table is a log-linear model for the cell frequencies given by:

$$E[\log (\mu)] = \beta_0 + \beta_{i.} + \beta_{.j} + \gamma_{ij} \quad i,j > 1 \qquad\mathcal{H}_0: \gamma_{ij} = 0, \quad i,j > 1$$

Also the t-test for unequal variances is well approximated by using the Huber White robust error estimation.

Related Solutions

Solved – $F$-test for hypothesis $\beta_1+\beta_2=2\beta_3$ in a regression

@Glen_b already provided a link to the discussion containing the theoretical aspects.

Here is a quick pratical example of how one would do it in R. Please also have a look at these documents which contain the theory as well as examples: Simultaneous Inference in General Parametric Models and Additional multcomp Examples.

We will use the mtcars dataset and build a linear regression model containing three variables: cyl (Number of cylinders), disp (Displacement) and hp (Horsepower) to predict the variable mpg (Miles/Gallon).

Then, we test the following hypothesis: $\beta_{\mathrm{cyl}}+\beta_{\mathrm{disp}}-2\cdot\beta_{\mathrm{hp}} = 0$.

Using the multcomp package, there are two ways of specifying the hypothesis:

As a matrix
by symbolic description

I included both version in the code below. In our example, the matrix would simply be a row vector: $\mathbf{K} = (0, 1, 1, -2)$. The zero at the beginning is necessary because our regression model includes an intercept.

By symbolic description means that you can simply state your hypothesis as a character string. In this case: "cyl + disp - 2*hp = 0".

In this example, the estimate of our hypothesis is $-1.2169$ with little evidence that it differs from $0$. The function confint is used to generate a confidence interval for the estimate: $(-2.86; 0.43)$.

#---------------------------------------------------------------------------------------
# Load "multcomp" package
#---------------------------------------------------------------------------------------

require(multcomp)

#---------------------------------------------------------------------------------------
# Load "mtcars" dataset
#---------------------------------------------------------------------------------------

data(mtcars)

#---------------------------------------------------------------------------------------
# Build linear regression model with three variables
#---------------------------------------------------------------------------------------

lm.mod <- lm(mpg~cyl+disp+hp, data = mtcars)

summary(lm.mod)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.18492    2.59078  13.195 1.54e-13 ***
cyl         -1.22742    0.79728  -1.540   0.1349    
disp        -0.01884    0.01040  -1.811   0.0809 .  
hp          -0.01468    0.01465  -1.002   0.3250 

#---------------------------------------------------------------------------------------
# Define the general hypothesis
#---------------------------------------------------------------------------------------

K <- c("cyl + disp - 2*hp = 0") # As a formula

# K <- rbind(c(0, 1, 1, -2)) # As a contrast matrix
# rownames(K) <- c("cyl + disp - 2hp")
# colnames(K) <- names(coef(lm.mod))

#---------------------------------------------------------------------------------------
# Evaluate the general hypothesis and calculate confidence intervals
#---------------------------------------------------------------------------------------

glht.mod <- glht(lm.mod, linfct = K)

summary(glht.mod)

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = mpg ~ cyl + disp + hp, data = mtcars)

Linear Hypotheses:
                         Estimate Std. Error t value Pr(>|t|)
cyl + disp - 2 * hp == 0  -1.2169     0.8036  -1.514    0.141
(Adjusted p values reported -- single-step method)

confint(glht.mod)

     Simultaneous Confidence Intervals

Fit: lm(formula = mpg ~ cyl + disp + hp, data = mtcars)

Quantile = 2.0484
95% family-wise confidence level

Linear Hypotheses:
                         Estimate lwr     upr    
cyl + disp - 2 * hp == 0 -1.2169  -2.8631  0.4293

Best Answer

Related Solutions

Solved – $F$-test for hypothesis $\beta_1+\beta_2=2\beta_3$ in a regression

Related Question