Solved – How to choose t-distribution degrees of freedom in “robust” Bayesian linear models

bayesiangeneralized linear modelhierarchical-bayesianregressionrobust

It is well known that in both frequentist and Bayesian linear models, outliers can greatly influence the parameter estimates. Consider the simple example where one outcome variable, $y$, is predicted by one independent variable, $x$. Under the conventional Bayesian linear model, $y$ would be normally distributed with mean $\beta_{0} + \beta_{1}x_i$ and variance $\sigma^2$. An outlier could skew the slope coefficient, $\beta_{1}$, for instance, making it look like there is a strong relationship between $y$ and $x$.

One solution to this problem I've seen proposed by Krushke and others is to replace the normal distribution for $y$ with a student $t$ distribution, which has an additional parameter (degrees of freedom, sometimes denoted $v$) that increases the density of the tails and thus compensates for outliers. In his textbook, Krushke allows this parameter to be informed by the data, but it seems (based on his figures) that it is greatly influenced by the choice of prior. My question is, why not just set $v$ to $n-1$, as in a frequentist $t$ test?

Best Answer

The degrees of freedom in a linear regression model with Student-t errors are not fixed neither in the classical nor in the Bayesian approach. You are mixing up inference with hypothesis tests.

The formulation is as follows. You have $n$ response variables $y_1,\dots,y_n$, $n$ covariate vectors ${\bf x}_1,\dots,{\bf x}_n$ and $p+1$ regression parameters $\beta_0,\beta_1,\dots,\beta_p$. The regression model is:

$$y_j=\beta_0 +\beta_1 x_{j1}\dots + \beta_p x_{jp} + \epsilon_j,$$

where $\epsilon_j$ is distributed according to a Student t distribution with $\nu>0$ degrees of freedom and scale parameter $\sigma>0$. Then, the likelihood function is:

$$L(\beta_0,\dots,\beta_p,\sigma,\nu) \propto \prod_{j=1}^n \dfrac{1}{\sigma}f\left(\dfrac{y_j-{\bf x}^{\top}_j\beta}{\sigma}\Bigg\vert \nu\right),$$

where $f(\cdot\vert \nu)$ is the Student t density with $\nu$ degrees of freedom. The maximum likelihood estimator is obtained by maximising the likelihood function with respect to all the parameters.

In order to obtain Bayesian inference, you need a prior for the parameters in order to construct the posterior distribution in the usual way: $Posterior \propto Likehood \times Prior$. Typically, people use a Normal distribution for ${\bf \beta}$, an inverse gamma for $\sigma$, and for $\nu$ several noninformative priors have been proposed. Here is a list of options:

  1. Jeffreys prior: http://biomet.oxfordjournals.org/content/95/2/325.short
  2. Juarez-Steel prior: http://onlinelibrary.wiley.com/doi/10.1002/jae.1113/abstract
  3. Discrete noninformative prior: https://projecteuclid.org/euclid.ba/1393251776
  4. Noninformative prior based on kurtosis: http://projecteuclid.org/euclid.ejs/1440680330
  5. Penalised complexity prior: http://arxiv.org/abs/1403.4630

Which one is better (and in what sense?)? That is an open question.