Nameless is wrong.
Math example
Raw Data: (1,1,0,0,1,1) (first three entries are Item 1 (repeated three times), last three Item 2 (likewise repeated 3 times)
corresponding predictions Model 1: (.5,.5,.5,.6,.6,.6)
corresponding predictions Model 2: (.9,.9,.9,.5,.5,.5)
-loglikelihood of Model 1: 4.01
-loglikelihood of Model 2: 4.59
The BIC penalty for each additional parameter would be ln(6)=1.79
BIC Model 1 (2 parameters): 11.62
BIC Model 2: (4 parameters): 16.35
Difference: 4.73 (rounded)
Aggregated Data (else same data): data=(2,2) (first entry is Item 1 (counted 2 times of 3), second one is Item 2 (likewise counted 2 out of 3 times) (same as before:)
corresponding predictions Model 1: (.5,.6)
corresponding predictions Model 2: (.9,.5)
-loglikelihood of Model 1: 1.82
-loglikelihood of Model 2: 2.39
When using n=2 (instead of 6 as with raw data) the BIC penalty for each additional parameter would be ln(2)=.69
BIC Model 1 (2 parameters): 5.03
BIC Model 2: (4 parameters): 7.56
Difference: 2.53
Note: this is a different result, than with the raw data above and underestimates the penalty!
When using n=6 in this case,
BIC Model 1 (2 parameters): 7.22
BIC Model 2: (4 parameters): 11.96
Difference: 4.73!
Note: This is the same BIC difference as with using raw data, although the data was aggregated and the loglikelihoods differ.
The reason is: althoug summing up 6 estimated -loglikelihoods instead of 2 (aggregated) leads to higher loglikelihoods in total for each model, but the difference between the -logliklehoods of both models is totally the same, no matter whether you are using raw or aggregated data, as long at is is the same data, and the same model prediction.
Use TOTAL SAMPLE SIZE (or underestimate the penalty)
I wonder how many people did this wrong until now... :)
Actually you may have a look at chapter 8.4.2 in Murphys book 'Machine Learning: A Probabilistic Perspective', where the BIC is nicely derived from the marginal likelihood. A flat prior is assumed there, however under some specific assumption the presented method (Laplace approximation) may be applied to a nonuniform prior as well, which results in the likelihood in the BIC being replaced by the posterior evaluated at the MAP estimate (similar to what you have written in Question 1, but without the integration). I have some scanned, self written pdf about this that I may share (unfortunately it's in half english/half german, as I had intended it for private use only).
Now, BICs rooting in the marginal likelihood also makes it clear that you were on the right track regarding question 1: If you can analytically compute the integral over $\theta_1$, then you do not have to refer to Laplaces approximation regarding $\theta_1$ and thus will end up having a better approximation of the marginal likelihood (which is what you are really interested in, BIC is just a substitute). The Laplace approximation may then be applied only to the resulting expression, i.e. $p(x|\theta_2,M)$. In other words, your approach is correct, you may swap the $\hat{L}$ of the original BIC with $\hat{L} =p(x|\theta_2,M)$.
Now w.r.t. question 2, as Laplaces method is based on the maximum exponent of the integrant in question (which is the joint probability of $x,\theta_1,\theta_2$ in case of the marginal likelihood) you must use the MAP estimate as your "highest likelihood point", i.e. the point $$(\theta^*_1,\theta^*_2) = argmax_{\{\theta_1,\theta_2\}} p(\theta_1,\theta_2|x,M) $$
Hope this helps...
Best Answer
If I got you correctly:
As BIC is basically used to compare models (as AIC or MDL), you can apply any monotone transformation as long as you do it for both of the compared models.