Here is a small example:
MyDf<-data.frame(x=c(1,2,3,4), y=c(1.2, .7, -.5, -3))
Now with thebase::lm
:
> lm(y~x, data=MyDf) %>% summary
Call:
lm(formula = y ~ x, data = MyDf)
Residuals:
1 2 3 4
-0.47 0.41 0.59 -0.53
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0500 0.8738 3.491 0.0732 .
x -1.3800 0.3191 -4.325 0.0495 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7134 on 2 degrees of freedom
Multiple R-squared: 0.9034, Adjusted R-squared: 0.8551
F-statistic: 18.71 on 1 and 2 DF, p-value: 0.04952
Now, try the same thing with biglm
from the biglm
package:
XX<-biglm(y~x, data=MyDf)
print(summary(XX), digits=5)
Large data regression model: biglm(y ~ x, data = MyDf)
Sample size = 4
Coef (95% CI) SE p
(Intercept) 3.05 1.30243 4.79757 0.87378 0.00048
x -1.38 -2.01812 -0.74188 0.31906 0.00002
Note that we need the print
and digits
to see the p-value. The coefficients and standard errors are the same, but the p-values are very different. Why is this so?
Best Answer
To see which p-values are correct (if either), let's repeat the calculation for simulated data in which the null hypothesis is true. In the present setting, the calculation is a least-squares fit to (x,y) data and the null hypothesis is that the slope is zero. In the question there are four x values 1,2,3,4 and the estimated error is around 0.7, so let's incorporate that in the simulation.
Here is the setup, written to be understandable to everyone, even those unfamiliar with
R
.The simulation generates independent errors, adds them to
y.expected
, invokeslm
to make the fit, andsummary
to compute the p-values. Although this is inefficient, it is testing the actual code that was used. We can still do thousands of iterations in a second:Correctly computed p-values will act like uniform random numbers between $0$ and $1$ when the null hypothesis is true. A histogram of these p-values will allow us to check this visually--does it look roughly horizontal--and a chi-squared test of uniformity will permit a more formal evaluation. Here's the histogram:
and, for those who might imagine this isn't sufficiently uniform, here's the chi-squared test:
The large p-value in this test shows these results are consistent with the expected uniformity. In other words,
lm
is correct.Where, then, do the differences in p-values come from? Let's check the likely formulas that might be invoked to compute a p-value. In any case the test statistic will be
$$|t| = \left| \frac{\hat\beta - 0}{\operatorname{se}(\hat \beta)}\right|,$$
equal to the discrepancy between the estimated coefficient $\hat \beta$ and the hypothesized (and correct value) $\beta = 0$, expressed as a multiple of the standard error of the coefficient estimate. In the question these values are
$$|t| = \left|\frac{3.05}{0.87378 }\right| = 3.491$$
for the intercept estimate and
$$|t| = \left|\frac{-1.38 }{ 0.31906 }\right| = 4.321$$
for the slope estimate. Ordinarily these would be compared to the Student $t$ distribution whose degrees of freedom parameter is $4$ (the amount of data) minus $2$ (the number of estimated coefficients). Let's calculate it for the intercept:
(This calculation multiplies the left-tailed Student $t$ probability by $2$ because this is a test of $H_0:\beta=0$ against the two-sided alternative $H_A:\beta \ne 0$.) It agrees with the
lm
output.An alternative calculation would use the standard Normal distribution to approximate the Student $t$ distribution. Let's see what it produces:
Sure enough:
biglm
assumes the null distribution of the $t$ statistic is standard Normal. How much of an error is this? Re-running the preceding simulation usingbiglm
in place oflm
gives this histogram of p-values:Almost 18% of these p-values are less than $0.05$, a standard threshold of "significance." That's an enormous error.
Some lessons we can learn from this little investigation are:
Do not use approximations derived from asymptotic analyses (like the standard Normal distribution) with small datasets.
Know your software.