Solved – Why do lm and biglm in R give different p-values for the same data

linear modelp-valuerregression

Here is a small example:

MyDf<-data.frame(x=c(1,2,3,4), y=c(1.2, .7, -.5, -3))

Now with thebase::lm:

> lm(y~x, data=MyDf) %>% summary

Call:
lm(formula = y ~ x, data = MyDf)

Residuals:
    1     2     3     4 
-0.47  0.41  0.59 -0.53 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.0500     0.8738   3.491   0.0732 .
x            -1.3800     0.3191  -4.325   0.0495 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7134 on 2 degrees of freedom
Multiple R-squared:  0.9034,    Adjusted R-squared:  0.8551 
F-statistic: 18.71 on 1 and 2 DF,  p-value: 0.04952

Now, try the same thing with biglm from the biglm package:

XX<-biglm(y~x, data=MyDf) 
print(summary(XX), digits=5)

Large data regression model: biglm(y ~ x, data = MyDf)
Sample size =  4 
             Coef     (95%      CI)      SE       p
(Intercept)  3.05  1.30243  4.79757 0.87378 0.00048
x           -1.38 -2.01812 -0.74188 0.31906 0.00002

Note that we need the print and digits to see the p-value. The coefficients and standard errors are the same, but the p-values are very different. Why is this so?

Best Answer

To see which p-values are correct (if either), let's repeat the calculation for simulated data in which the null hypothesis is true. In the present setting, the calculation is a least-squares fit to (x,y) data and the null hypothesis is that the slope is zero. In the question there are four x values 1,2,3,4 and the estimated error is around 0.7, so let's incorporate that in the simulation.

Here is the setup, written to be understandable to everyone, even those unfamiliar with R.

beta <- c(intercept=0, slope=0)
sigma <- 0.7
x <- 1:4
y.expected <-  beta["intercept"] + beta["slope"] * x

The simulation generates independent errors, adds them to y.expected, invokes lm to make the fit, and summary to compute the p-values. Although this is inefficient, it is testing the actual code that was used. We can still do thousands of iterations in a second:

n.sim <- 1e3
set.seed(17)
data.simulated <- matrix(rnorm(n.sim*length(y.expected), y.expected, sigma), ncol=n.sim)
slope.p.value <- function(e) coef(summary(lm(y.expected + e ~ x)))["x", "Pr(>|t|)"]
p.values <- apply(data.simulated, 2, slope.p.value)

Correctly computed p-values will act like uniform random numbers between $0$ and $1$ when the null hypothesis is true. A histogram of these p-values will allow us to check this visually--does it look roughly horizontal--and a chi-squared test of uniformity will permit a more formal evaluation. Here's the histogram:

h <- hist(p.values, breaks=seq(0, 1, length.out=20))

and, for those who might imagine this isn't sufficiently uniform, here's the chi-squared test:

chisq.test(h$counts)

X-squared = 13.042, df = 18, p-value = 0.7891

The large p-value in this test shows these results are consistent with the expected uniformity. In other words, lm is correct.

Where, then, do the differences in p-values come from? Let's check the likely formulas that might be invoked to compute a p-value. In any case the test statistic will be

$$|t| = \left| \frac{\hat\beta - 0}{\operatorname{se}(\hat \beta)}\right|,$$

equal to the discrepancy between the estimated coefficient $\hat \beta$ and the hypothesized (and correct value) $\beta = 0$, expressed as a multiple of the standard error of the coefficient estimate. In the question these values are

$$|t| = \left|\frac{3.05}{0.87378 }\right| = 3.491$$

for the intercept estimate and

$$|t| = \left|\frac{-1.38 }{ 0.31906 }\right| = 4.321$$

for the slope estimate. Ordinarily these would be compared to the Student $t$ distribution whose degrees of freedom parameter is $4$ (the amount of data) minus $2$ (the number of estimated coefficients). Let's calculate it for the intercept:

pt(-abs(3.05/0.87378), 4-2) * 2

[1] 0.0732

(This calculation multiplies the left-tailed Student $t$ probability by $2$ because this is a test of $H_0:\beta=0$ against the two-sided alternative $H_A:\beta \ne 0$.) It agrees with the lm output.

An alternative calculation would use the standard Normal distribution to approximate the Student $t$ distribution. Let's see what it produces:

pnorm(-abs(3.05/0.87378)) * 2

[1] 0.000482

Sure enough: biglm assumes the null distribution of the $t$ statistic is standard Normal. How much of an error is this? Re-running the preceding simulation using biglm in place of lm gives this histogram of p-values:

Almost 18% of these p-values are less than $0.05$, a standard threshold of "significance." That's an enormous error.

Some lessons we can learn from this little investigation are:

Do not use approximations derived from asymptotic analyses (like the standard Normal distribution) with small datasets.
Know your software.

Related Solutions

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Solved – Comparing two linear regression models

If you set up the data in one long column with A and B as a new column, you then can run your regression model as a GLM with a continuous time variable and a nominal "experiment" variable (A, B). The output of the ANOVA will give you the significance of the difference between the parameters. "intercept' is the common intercept and the "experiment" factor will reflect differences between the intercepts (actually overall means) between the experiments. the "Time" factor will be the common slope, and the interaction is the difference between the experiments with respect to the slope.

I have to admit I cheat (?) and run the models separately first to get the two sets of parameters and their errors and then run the combined model to acquire the differences between the treatments (in your case A and B)...

Best Answer

Related Solutions

Solved – Why does GBM predict different values for the same data

Solved – Comparing two linear regression models

Related Question