Regression Analysis – Why Linear Regression Gives a Good R-Squared but High BIC

bicr-squaredregression

I tried a linear model with interactions on my data ($n=95,840$ rows):

 $ NUM_SITE : Factor w/ 9 levels "8647","8666",..: 3 4 5 8 9 1 2 3 4 6 ...
 $ TEMPERATURE_AIR            : num  6.29 6.13 6 7.05 8.16 ...
 $ MonthNumber                : Factor w/ 12 levels "1","2","3","4",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ HOURS                      : Factor w/ 24 levels "0","1","2","3",..: 7 7 7 7 7 8 8 8 8 8 ...
 $ TEMPERATURE_COUPON         : num  5.1 6.6 4.5 5.4 4.7 ...


reg_sencrop_interactions_all_coupon_without8668 = lm(formula = TEMPERATURE_COUPON ~ TEMPERATURE_AIR * MonthNumber * HOURS, data = data)

I'm surprised by the $BIC$ score compared to $R^2$. My $R^2$ is good:

summary(reg_sencrop_interactions_all_coupon_without8668)$r.squared
[1] 0.9220565

But my $BIC$ is high:

BIC(reg_sencrop_interactions_all_coupon_without8668)
[1] 443093.3

How can this big difference be explained?

Best Answer

Suppose you have a Gaussian regression model with $k$ model terms and you observe $n$ data points to fit the model. Let $R^2$ denote the coefficient-of-determination for the regression and let $s_Y^2$ denote the sample variance of the response variable. In a related answer I show that the Bayesian Information Criterion (BIC) can be written in terms of the coefficient-of-determination and these other quantities as:

$$\text{BIC} = n \Bigg[ 1+\ln (2 \pi) + \frac{k\ln(n)}{n} + \ln \bigg( 1-\frac{1}{n} \bigg) + \ln (1-R^2) + \ln (s_Y^2) \Bigg].$$

In particular, when $n$ is large you have:

$$\text{BIC} \approx n \Bigg[ 1 + \ln (2 \pi) + \ln (1-R^2) + \ln (s_Y^2) \Bigg].$$

(In the present case you have a regression model that includes an intercept term, so I have simplified the total degrees-of-freedom in the linked expression.) We can see from this equation that a higher coefficient-of-determination leads to a lower BIC, as we would expect. However, there are also a number of other terms in the expression that could give you a "big" BIC. In particular, the BIC is a scale-dependent measure that is roughly proportionate to the scale $n$.

As some of the comments point out, it is arbitrary to say that your BIC is "big" without something to compare it to. In particular, the BIC is roughly proportionate to the sample size which is high in this case. In your present analysis you have $n=95840$ data points and $k= 12 \times 24 = 288$ model terms, and you have $R^2 = 0.9220565$ and $\text{BIC} = 443093.3$. In the present case, the terms inside the brackets of the above expression are:

$$\begin{align} 1 + \ln (2 \pi) &= 2.837877, \\[12pt] \frac{k \ln(n)}{n} &= 0.03446875, \\[6pt] \ln \bigg( 1-\frac{1}{n} \bigg) &= -1.043411 \times 10^{-5}, \\[6pt] \ln (1-R^2) &= -2.551771, \\[14pt] \ln (s_Y^2) &\approx 4.302696. \\[6pt] \end{align}$$

(I have used your values to reverse-engineer the sample variance of your response variable, so it is only approximate, and subject to substantial rounding error. Since you have access to the raw data you can easily compute this term with far greater precision.) This gives a sense of the contribution to the BIC from each individual part. You can see that the part involving the sample variance of the response variable is contributing more to the BIC than the part involving coefficient-of-determination. However, the primary reason you have a high BIC is that $n$ is high.

Related Question