The mean square error MSE reported for this regression is about 77 months squared, the ridiculous default display format of your software aside.

You quite rightly work with its square root, about 8.8 months, which we can call the root MSE or RMSE. Several other names exist. Regardless of terminology, I would assert that the MSE is usually no help at all in thinking about the model while the RMSE can be helpful, the difference lying in using familiar units of measurement. But even with square rooting, a few large errors or residuals can pull up either measure, and that is consistent with many errors being quite small.

Without seeing your data I am not at all surprised by that result for either measure. In this kind of data, the uncertainty of predictions is high even given good information on the characteristics of the patients. Everyone has heard stories of people given 6 months to survive but lasting 2 years, and so forth. That's usually a matter of expert guesses being uncertain rather than a regression model leaving a considerable degree of scatter, but there is a link.

In broad terms, $R^2$ and RMSE are on quite different scales. If they appear to be conflicting, that can always be resolved by looking at the complete set of errors or residuals. Even with several predictors in a regression model, you can always inspect one or more of a histogram of residuals, a plot of residuals versus predicted values, or a plot of observed versus predicted values.

*An aside on two minor points of language*

Even when used informally, I would advise strongly against wording such as

- "the regression model correlate[s] with the data 90% of the time"

It is in no sense helpful, or even meaningful, to regard $R^2$ as the fraction of the time, or even of the data, that the model shows a correlation. Correlation is a property of a dataset, not individual observations. Moreover, correlation is a matter of degree, not a yes or no matter.

- "a significant amount to be off"

The word "significant" is over-loaded in discussions of statistical applications. There are good words like "big", "large", "notable", "major" to be used when your meaning is informal. Reserve the term "significant" for discussing the results of a formal significance test.

P.S. In 45 years or so of using statistics, I don't think I have ever seen $R^2$ reported to as many as 13 decimal places. Most experienced researchers wouldn't use more than 3.

I cannot offer you a definite answer for that particular package, but in general, the Bayesian Information Criterion is a *large-sample approximation* to the Bayes factor. As such, as the likelihood dominates the prior in large samples, the effect of the prior is neglected.

Here is a somewhat heuristic derivation.

We can obtain a large sample normal approximation to the log-likelihood $l(\theta|y)$ in the multiparameter case. A multivariate Taylor expansion gives
\begin{eqnarray}
l(\theta|y)&\approx&l(\hat\theta|y)+(\theta-\hat\theta)'\frac{\partial l(\hat\theta|y)}{\partial\theta}-\frac{n}{2}(\theta-\hat\theta)'V^{-1}(\theta-\hat\theta)\notag\\
&=& l(\hat\theta|y)-\frac{n}{2}(\theta-\hat\theta)'V^{-1}(\theta-\hat\theta)
\end{eqnarray}
where
$$
V=\left(-\frac{1}{n}\sum_j\frac{\partial^2 l(\hat\theta|y_j)}{\partial\theta\partial\theta'}\right)^{-1}=\left(-\frac{1}{n}\frac{\partial^2 l(\hat\theta|y)}{\partial\theta\partial\theta'}\right)^{-1}
$$
Let $$m_i(y)=\int f_{i}\left(y|\theta_{i},M_i\right)\pi_{i}\left(\theta_{i}|M_i\right)d\theta_{i}$$

Substituting the exponential of the approximate expression for $l(\theta|y)$ gives
\begin{eqnarray*}
m_i(y)&\approx&L(\hat\theta_i|y)\int\exp\left(-\frac{n}{2}(\theta_i-\hat\theta_i)'V_i^{-1}(\theta_i-\hat\theta_i)\right)\pi_i(\theta_i|M_i)d\theta_i\\
&\approx&L(\hat\theta_i|y)\pi_i(\hat\theta_i|M_i)\int\exp\left(-\frac{n}{2}(\theta_i-\hat\theta_i)'V_i^{-1}(\theta_i-\hat\theta_i)\right)d\theta_i,
\end{eqnarray*}
where the second approximation arises because the $\exp$ part dominates around $\hat\theta_i$ (look for Laplace approximation to get a rigorous argument here).

As the integrand now is a kernel of a multivariate normal distribution with covariance matrix $V_i/n$, carrying out the integration gives
\begin{eqnarray*}
m_i(y)&\approx&L(\hat\theta_i|y)\pi_i(\hat\theta_i|M_i)(2\pi)^{d_i/2}|n^{-1}V_i|^{1/2}\\
&=&L(\hat\theta_i|y)\pi_i(\hat\theta_i|M_i)(2\pi)^{d_i/2}n^{-d_i/2}|V_i|^{1/2},
\end{eqnarray*}
where $d_i$ is the number of parameters in $\theta_i$.

The log-Bayes factor for models 1 and 2 can now be approximated by
\begin{eqnarray*}
\log(B_{12})&\approx&\left[\log\left(\frac{L_1(\hat\theta_1|y)}{L_2(\hat\theta_2|y)}\right)-\frac{d_1-d_2}{2}\log(n)\right]\\
&&+\left[\log\left(\frac{\pi_1(\hat\theta_1|M_1)}{\pi_2(\hat\theta_2|M_2)}\right)+\frac{1}{2}\log\frac{|V_1|}{|V_2|}+\frac{d_1-d_2}{2}\log(2\pi)\right]
\end{eqnarray*}
The second squared brackets can be neglected for large $n$. The first term is the log-likelihood ratio, which is large if the data favors $M_1$. The second term penalizes larger models.

## Best Answer

Suppose you have a Gaussian regression model with $k$ model terms and you observe $n$ data points to fit the model. Let $R^2$ denote the coefficient-of-determination for the regression and let $s_Y^2$ denote the sample variance of the response variable. In a related answer I show that the Bayesian Information Criterion (BIC) can be written in terms of the coefficient-of-determination and these other quantities as:

$$\text{BIC} = n \Bigg[ 1+\ln (2 \pi) + \frac{k\ln(n)}{n} + \ln \bigg( 1-\frac{1}{n} \bigg) + \ln (1-R^2) + \ln (s_Y^2) \Bigg].$$

In particular, when $n$ is large you have:

$$\text{BIC} \approx n \Bigg[ 1 + \ln (2 \pi) + \ln (1-R^2) + \ln (s_Y^2) \Bigg].$$

(In the present case you have a regression model that includes an intercept term, so I have simplified the total degrees-of-freedom in the linked expression.) We can see from this equation that a higher coefficient-of-determination leads to a lower BIC, as we would expect. However, there are also a number of other terms in the expression that could give you a "big" BIC. In particular, the BIC is a

scale-dependent measurethat is roughly proportionate to the scale $n$.As some of the comments point out, it is arbitrary to say that your BIC is "big" without something to compare it to. In particular, the BIC is

roughly proportionate to the sample sizewhich is high in this case. In your present analysis you have $n=95840$ data points and $k= 12 \times 24 = 288$ model terms, and you have $R^2 = 0.9220565$ and $\text{BIC} = 443093.3$. In the present case, the terms inside the brackets of the above expression are:$$\begin{align} 1 + \ln (2 \pi) &= 2.837877, \\[12pt] \frac{k \ln(n)}{n} &= 0.03446875, \\[6pt] \ln \bigg( 1-\frac{1}{n} \bigg) &= -1.043411 \times 10^{-5}, \\[6pt] \ln (1-R^2) &= -2.551771, \\[14pt] \ln (s_Y^2) &\approx 4.302696. \\[6pt] \end{align}$$

(I have used your values to reverse-engineer the sample variance of your response variable, so it is only approximate, and subject to substantial rounding error. Since you have access to the raw data you can easily compute this term with far greater precision.) This gives a sense of the contribution to the BIC from each individual part. You can see that the part involving the sample variance of the response variable is contributing more to the BIC than the part involving coefficient-of-determination. However, the primary reason you have a high BIC is that $n$ is high.