Solved – In Bayesian Information Criterion (BIC), why does having bigger n get penalized

aicbic

The Bayesian Information Criterion (BIC) is calculated with:

$$
\text{BIC} = \frac{1}{n \hat{\delta}^2} \Big(\text{RSS} + \ln(n) d \hat{\delta}^2 \Big)
$$

where RSS is residual sum of squares and delta squared is estimate of the variance of the error associated with each response measurement.

I have two questions about this concept BIC.

Q1. Why does having more sample size get penalized, when usually having bigger data sample size is always better than having few?

I have learned that having more sample data size is always better. For example, if you have more data samples, you will have smaller standard error, narrower confidence interval and smaller standard deviation.

But according to this BIC's formula, the statistical model with more sample data would get penalized, which means having less chance to get selected. It gets more obvious when BIC is compared to AIC. As AIC uses 2 instead of ln(n) in its formula, if the sample size n of the model is bigger than 7, that model has less chance to get selected when we use BIC as a way of choosing the optimal model. Why would the creator of BIC want to penalize the model with bigger number of sample size n?

Q2. Why does my textbook 'An Introduction to Statistical Learning' change the meaning of n to 'variable', when we have d, which stands for the number of predictors in the statistical model?

My books says as follows about BIC.

Notice that BIC replaces the $2 d \hat{\delta}^2$ used by Cp with a $\ln(n) d \hat{\delta}^2$ term, where n is the number of observations. Since ln(n) >2 for any n>7, the BIC statistics generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp. (p 212)

I cannot guess why the author of this book changed the meaning of n, from 'the number of observations (sample data points) to 'the number of variables'. Don't we already have the variable d, which shows the number of predictor varaibles plus intercept?

I would deeply appreciate if anyone here can answer my two questions. Thank you very much for reading!

Best Answer

I think the following will answer both of your questions.

First of all, you select the model that has the minimum value when using such criteria, therefore n has the opposite effect than you wrote down since increase in n alone will decrease the value.

Secondly, the information criteria is used to select between different models, not to select between different samples. The reason for these criteria to be used is the fact that adding more parameters will always increase the fit however it does not necessarily mean that the model is better due parsimony and degrees of freedom concerns in academy and overfitting concerns in practice.

A criterion such as BIC will be used to compare models that have different variables, where n will be the same. Therefore n is not there to penalize or favor the sample size. I am guessing it is there to normalize RSS, since RSS will increase indefinitely with n. On contrast, adding more parameters is penalized as it increases the value of the criteria.

Related Solutions

Solved – ny reason to prefer the AIC or BIC over the other

Your question implies that AIC and BIC try to answer the same question, which is not true. The AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates. I find it quite odd the assumption that reality is instantiated in one of the models that the researchers built along the way. This is a real issue for BIC.

Nevertheless, there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models. Overfitting occurs when the wrong model fits the data better than the generating. The point of these simulations is to see how well AIC and BIC correct these overfits. Usually, the results point to the fact that AIC is too liberal and still frequently prefers a more complex, wrong model over a simpler, true model. At first glance these simulations seem to be really good arguments, but the problem with them is that they are meaningless for AIC. As I said before, AIC does not consider that any of the candidate models being tested is actually true. According to AIC, all models are approximations to reality, and reality should never have a low dimensionality. At least lower than some of the candidate models.

My recommendation is to use both AIC and BIC. Most of the times they will agree on the preferred model, when they don't, just report it.

If you are unhappy with both AIC and BIC and have free time to invest, look up Minimum Description Length (MDL), a totally different approach that overcomes the limitations of AIC and BIC. There are several measures stemming from MDL, like normalized maximum likelihood or the Fisher Information approximation. The problem with MDL is that its mathematically demanding and/or computationally intensive.

Still, if you want to stick to simple solutions, a nice way for assessing model flexibility (especially when the number of parameters are equal, rendering AIC and BIC useless) is doing Parametric Bootstrap, which is quite easy to implement. Here is a link to a paper on it.

Some people here advocate for the use of cross-validation. I personally have used it and don't have anything against it, but the issue with it is that the choice among the sample-cutting rule (leave-one-out, K-fold, etc) is an unprincipled one.

Solved – get equal AIC, BIC and log likelihood for different models in LME framework

The models are exactly equivalent. In both models you effectively specify one parameter for each combination of levels of Season and Crownlevel - the only difference is the parameterization:

In the first model, you fit main effects for Season and Crownlevel and an interaction effect to capture the combination-specific deviations from the main effects.

In the second model, you specify only the main effect of season, and the interaction effect then captures the deviations for each crownlevel within a season.

H_CE~Season:Crownlevel

would also yield an equivalent model, with one parameter for each combination of season and crownlevel (minus one that is non-identifiable because of the intercept, i.e. constitutes the reference category).

BTW: I don't think your model specification is faulty, which specification is better depends on the inference you want to do with your model.

Best Answer

Related Solutions

Solved – ny reason to prefer the AIC or BIC over the other

Solved – get equal AIC, BIC and log likelihood for different models in LME framework

Related Question