Solved – In Bayesian Information Criterion (BIC), why does having bigger n get penalized

aicbic

The Bayesian Information Criterion (BIC) is calculated with:

$$
\text{BIC} = \frac{1}{n \hat{\delta}^2} \Big(\text{RSS} + \ln(n) d \hat{\delta}^2 \Big)
$$

where RSS is residual sum of squares and delta squared is estimate of the variance of the error associated with each response measurement.

I have two questions about this concept BIC.

Q1. Why does having more sample size get penalized, when usually having bigger data sample size is always better than having few?

I have learned that having more sample data size is always better. For example, if you have more data samples, you will have smaller standard error, narrower confidence interval and smaller standard deviation.

But according to this BIC's formula, the statistical model with more sample data would get penalized, which means having less chance to get selected. It gets more obvious when BIC is compared to AIC. As AIC uses 2 instead of ln(n) in its formula, if the sample size n of the model is bigger than 7, that model has less chance to get selected when we use BIC as a way of choosing the optimal model. Why would the creator of BIC want to penalize the model with bigger number of sample size n?

Q2. Why does my textbook 'An Introduction to Statistical Learning' change the meaning of n to 'variable', when we have d, which stands for the number of predictors in the statistical model?

My books says as follows about BIC.

Notice that BIC replaces the $2 d \hat{\delta}^2$ used by Cp with a $\ln(n) d \hat{\delta}^2$ term, where n is the number of observations. Since ln(n) >2 for any n>7, the BIC statistics generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp. (p 212)

I cannot guess why the author of this book changed the meaning of n, from 'the number of observations (sample data points) to 'the number of variables'. Don't we already have the variable d, which shows the number of predictor varaibles plus intercept?

I would deeply appreciate if anyone here can answer my two questions. Thank you very much for reading!

Best Answer

I think the following will answer both of your questions.

First of all, you select the model that has the minimum value when using such criteria, therefore n has the opposite effect than you wrote down since increase in n alone will decrease the value.

Secondly, the information criteria is used to select between different models, not to select between different samples. The reason for these criteria to be used is the fact that adding more parameters will always increase the fit however it does not necessarily mean that the model is better due parsimony and degrees of freedom concerns in academy and overfitting concerns in practice.

A criterion such as BIC will be used to compare models that have different variables, where n will be the same. Therefore n is not there to penalize or favor the sample size. I am guessing it is there to normalize RSS, since RSS will increase indefinitely with n. On contrast, adding more parameters is penalized as it increases the value of the criteria.

Related Question