Solved – the correct sample size in BIC formula

bic

The definition of the Bayesian information criterion is usually given as $$\operatorname{BIC} = -2 \text{ln}(L) + k\text{ln}(n)\,,$$ where $\ln(L)$ is the maximized log-likelihood of the data given a particular model, $k$ is the number of free parameters that this model has, and finally $n$ is the number of data points that the model is being fit to.

If I had a model which predicts the number of correct responses in two experimental conditions with 100 trials each, and my data consisted of the number of correct responses in each of the two conditions (e.g., 34 out of 100), then what exactly should I use for $n$, 2 or 200?

Best Answer

Nameless is wrong.

Math example

Raw Data: (1,1,0,0,1,1) (first three entries are Item 1 (repeated three times), last three Item 2 (likewise repeated 3 times)

corresponding predictions Model 1: (.5,.5,.5,.6,.6,.6)
corresponding predictions Model 2: (.9,.9,.9,.5,.5,.5)
-loglikelihood of Model 1: 4.01
-loglikelihood of Model 2: 4.59

The BIC penalty for each additional parameter would be ln(6)=1.79

BIC Model 1 (2 parameters): 11.62
BIC Model 2: (4 parameters): 16.35
Difference: 4.73 (rounded)

Aggregated Data (else same data): data=(2,2) (first entry is Item 1 (counted 2 times of 3), second one is Item 2 (likewise counted 2 out of 3 times) (same as before:)

corresponding predictions Model 1: (.5,.6)
corresponding predictions Model 2: (.9,.5)
-loglikelihood of Model 1: 1.82
-loglikelihood of Model 2: 2.39

When using n=2 (instead of 6 as with raw data) the BIC penalty for each additional parameter would be ln(2)=.69

BIC Model 1 (2 parameters): 5.03
BIC Model 2: (4 parameters): 7.56
Difference: 2.53

Note: this is a different result, than with the raw data above and underestimates the penalty!

When using n=6 in this case,
BIC Model 1 (2 parameters): 7.22
BIC Model 2: (4 parameters): 11.96
Difference: 4.73!

Note: This is the same BIC difference as with using raw data, although the data was aggregated and the loglikelihoods differ.

The reason is: althoug summing up 6 estimated -loglikelihoods instead of 2 (aggregated) leads to higher loglikelihoods in total for each model, but the difference between the -logliklehoods of both models is totally the same, no matter whether you are using raw or aggregated data, as long at is is the same data, and the same model prediction.

Use TOTAL SAMPLE SIZE (or underestimate the penalty) I wonder how many people did this wrong until now... :)

Related Question