Solved – Why do lm and biglm in R give different p-values for the same data

linear modelp-valuerregression

Here is a small example:

MyDf<-data.frame(x=c(1,2,3,4), y=c(1.2, .7, -.5, -3))

Now with thebase::lm:

> lm(y~x, data=MyDf) %>% summary

Call:
lm(formula = y ~ x, data = MyDf)

Residuals:
    1     2     3     4 
-0.47  0.41  0.59 -0.53 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.0500     0.8738   3.491   0.0732 .
x            -1.3800     0.3191  -4.325   0.0495 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7134 on 2 degrees of freedom
Multiple R-squared:  0.9034,    Adjusted R-squared:  0.8551 
F-statistic: 18.71 on 1 and 2 DF,  p-value: 0.04952

Now, try the same thing with biglm from the biglm package:

XX<-biglm(y~x, data=MyDf) 
print(summary(XX), digits=5)

Large data regression model: biglm(y ~ x, data = MyDf)
Sample size =  4 
             Coef     (95%      CI)      SE       p
(Intercept)  3.05  1.30243  4.79757 0.87378 0.00048
x           -1.38 -2.01812 -0.74188 0.31906 0.00002

Note that we need the print and digits to see the p-value. The coefficients and standard errors are the same, but the p-values are very different. Why is this so?

Best Answer

To see which p-values are correct (if either), let's repeat the calculation for simulated data in which the null hypothesis is true. In the present setting, the calculation is a least-squares fit to (x,y) data and the null hypothesis is that the slope is zero. In the question there are four x values 1,2,3,4 and the estimated error is around 0.7, so let's incorporate that in the simulation.

Here is the setup, written to be understandable to everyone, even those unfamiliar with R.

beta <- c(intercept=0, slope=0)
sigma <- 0.7
x <- 1:4
y.expected <-  beta["intercept"] + beta["slope"] * x

The simulation generates independent errors, adds them to y.expected, invokes lm to make the fit, and summary to compute the p-values. Although this is inefficient, it is testing the actual code that was used. We can still do thousands of iterations in a second:

n.sim <- 1e3
set.seed(17)
data.simulated <- matrix(rnorm(n.sim*length(y.expected), y.expected, sigma), ncol=n.sim)
slope.p.value <- function(e) coef(summary(lm(y.expected + e ~ x)))["x", "Pr(>|t|)"]
p.values <- apply(data.simulated, 2, slope.p.value)

Correctly computed p-values will act like uniform random numbers between $0$ and $1$ when the null hypothesis is true. A histogram of these p-values will allow us to check this visually--does it look roughly horizontal--and a chi-squared test of uniformity will permit a more formal evaluation. Here's the histogram:

h <- hist(p.values, breaks=seq(0, 1, length.out=20))

Figure

and, for those who might imagine this isn't sufficiently uniform, here's the chi-squared test:

chisq.test(h$counts)

X-squared = 13.042, df = 18, p-value = 0.7891

The large p-value in this test shows these results are consistent with the expected uniformity. In other words, lm is correct.

Where, then, do the differences in p-values come from? Let's check the likely formulas that might be invoked to compute a p-value. In any case the test statistic will be

$$|t| = \left| \frac{\hat\beta - 0}{\operatorname{se}(\hat \beta)}\right|,$$

equal to the discrepancy between the estimated coefficient $\hat \beta$ and the hypothesized (and correct value) $\beta = 0$, expressed as a multiple of the standard error of the coefficient estimate. In the question these values are

$$|t| = \left|\frac{3.05}{0.87378 }\right| = 3.491$$

for the intercept estimate and

$$|t| = \left|\frac{-1.38 }{ 0.31906 }\right| = 4.321$$

for the slope estimate. Ordinarily these would be compared to the Student $t$ distribution whose degrees of freedom parameter is $4$ (the amount of data) minus $2$ (the number of estimated coefficients). Let's calculate it for the intercept:

pt(-abs(3.05/0.87378), 4-2) * 2

[1] 0.0732

(This calculation multiplies the left-tailed Student $t$ probability by $2$ because this is a test of $H_0:\beta=0$ against the two-sided alternative $H_A:\beta \ne 0$.) It agrees with the lm output.

An alternative calculation would use the standard Normal distribution to approximate the Student $t$ distribution. Let's see what it produces:

pnorm(-abs(3.05/0.87378)) * 2

[1] 0.000482

Sure enough: biglm assumes the null distribution of the $t$ statistic is standard Normal. How much of an error is this? Re-running the preceding simulation using biglm in place of lm gives this histogram of p-values:

Figure 2

Almost 18% of these p-values are less than $0.05$, a standard threshold of "significance." That's an enormous error.


Some lessons we can learn from this little investigation are:

  1. Do not use approximations derived from asymptotic analyses (like the standard Normal distribution) with small datasets.

  2. Know your software.

Related Question