Solved – What would a robust Bayesian model for estimating the scale of a roughly normal distribution be

bayesianestimationrrobuststandard deviation

There exists a number of robust estimators of scale. A notable example is the median absolute deviation which relates to the standard deviation as $\sigma = \mathrm{MAD}\cdot1.4826$. In a Bayesian framework there exist a number of ways to robustly estimate the location of a roughly normal distribution (say a Normal contaminated by outliers), for example, one could assume the data is distributed as a t distribution or Laplace distribution. Now my question:

What would a Bayesian model for measuring the scale of a roughly normal distribution in a robust way be, robust in the same sense as the MAD or similar robust estimators?

As is the case with MAD, it would be neat if the Bayesian model could approach the SD of a normal distribution in the case when the distribution of the data actually is normally distributed.

edit 1:

A typical example of a model that is robust against contamination/outliers when assuming the data $y_i$ is roughly normal is using a t distribution like:

$$y_i \sim \mathrm{t}(m, s,\nu)$$

Where $m$ is the mean, $s$ is the scale, and $\nu$ is the degree-of-freedom. With suitable priors on $m, s$ and $\nu$, $m$ will be an estimate of the mean of $y_i$ that will be robust against outliers. However, $s$ will not be a consistent estimate of the SD of $y_i$ as $s$ depends on $\nu$. For example, if $\nu$ would be fixed to 4.0 and the model above would be fitted to a huge number of samples from a $\mathrm{Norm}(\mu=0,\sigma=1)$ distribution then $s$ would be around 0.82. What I'm looking for is a model which is robust, like the t model, but for the SD instead of (or in addition to) the mean.

edit 2:

Here follows a coded example in R and JAGS of how the t-model mentioned above is more robust with respect to the mean.

# generating some contaminated data
y <- c( rnorm(100, mean=10, sd=10), 
        rnorm(10, mean=100, sd= 100))

#### A "standard" normal model ####
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dnorm(mu, inv_sigma2)
  }

  mu ~ dnorm(0, 0.00001)
  inv_sigma2 ~ dgamma(0.0001, 0.0001)
  sigma <- 1 / sqrt(inv_sigma2)
}"

model <- jags.model(textConnection(model_string), list(y = y))
mcmc_samples <- coda.samples(model, "mu", n.iter=10000)
summary(mcmc_samples)

### The quantiles of the posterior of mu
##  2.5%   25%   50%   75% 97.5% 
##   9.8  14.3  16.8  19.2  24.1 

#### A (more) robust t-model ####
library(rjags)
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dt(mu, inv_s2, nu)
  }

  mu ~ dnorm(0, 0.00001)
  inv_s2 ~ dgamma(0.0001,0.0001)
  s <- 1 / sqrt(inv_s2)
  nu ~ dexp(1/30) 
}"

model <- jags.model(textConnection(model_string), list(y = y))
mcmc_samples <- coda.samples(model, "mu", n.iter=1000)
summary(mcmc_samples)

### The quantiles of the posterior of mu
## 2.5%   25%   50%   75% 97.5% 
##8.03  9.35  9.99 10.71 12.14

Best Answer

Bayesian inference in a T noise model with an appropriate prior will give a robust estimate of location and scale. The precise conditions that the likelihood and prior need to satisfy are given in the paper Bayesian robustness modelling of location and scale parameters by Andrade and O'Hagan (2011). The estimates are robust in the sense that a single observation cannot make the estimates arbitrarily large, as demonstrated in figure 2 of the paper.

When the data is normally distributed, the SD of the fitted T distribution (for fixed $\nu$) does not match the SD of the generating distribution. But this is easy to fix. Let $\sigma$ be the standard deviation of the generating distribution and let $s$ be the standard deviation of the fitted T distribution. If the data is scaled by 2, then from the form of the likelihood we know that $s$ must scale by 2. This implies that $s = \sigma f(\nu)$ for some fixed function $f$. This function can be computed numerically by simulation from a standard normal. Here is the code to do this:

library(stats)
library(stats4)
y = rnorm(100000, mean=0,sd=1)
nu = 4
nLL = function(s) -sum(stats::dt(y/s,nu,log=TRUE)-log(s))
fit = mle(nLL, start=list(s=1), method="Brent", lower=0.5, upper=2)
# the variance of a standard T is nu/(nu-2)
print(coef(fit)*sqrt(nu/(nu-2)))

For example, at $\nu=4$ I get $f(\nu)=1.18$. The desired estimator is then $\hat{\sigma} = s/f(\nu)$.

Related Solutions

Solved – Robust parameter estimation for shifted log normal distribution

In case anyone is still interested, I have managed to implement Aristizabal's formulae in Java. This is more proof-of-concept than the requested "robust" code, but it is a starting point.

/**
 * Computes the point estimate of the shift offset (gamma) from the given sample. The sample array will be sorted by this method.<p>
 * Cf. Aristizabal section 2.2 ff.
 * @param sample {@code double[]}, will be sorted
 * @return gamma point estimate
 */
public static double pointEstimateOfGammaFromSample(double[] sample) {
    Arrays.sort(sample);
    DoubleUnaryOperator func = x->calculatePivotalOfSortedSample(sample, x)-1.0;
    double upperLimit = sample[0];
    double lowerLimit = 0;
    double gamma = bisect(func, lowerLimit, upperLimit);
    return gamma;
}

/**
 * Cf. Aristizabal's equation (2.3.1)
 * @param sample {@code double[]}, should be sorted in ascending order
 * @param gamma shift offset
 * @return pivotal value of sample
 */
private static double calculatePivotalOfSortedSample(final double[] sample, double gamma) {
    final int n=sample.length;
    final int n3=n/3;
    final double mid = avg(sample, gamma, n3+1, n-n3);
    final double low = avg(sample, gamma, 1, n3);
    final double upp = avg(sample, gamma, n-n3+1, n);
    final double result = (mid-low)/(upp-mid);
    return result;
}

/**
 * Computes average of sample values from {@code sample[l-1]} to {@code sample[u-1]}.
 * @param sample {@code double[]}, should be sorted in ascending order
 * @param gamma shift offset
 * @param l lower limit
 * @param u upper limit
 * @return average
 */
private static double avg(double[] sample, double gamma, int l, int u) {
    double sum = 0.0;
    for (int i=l-1;i<u;sum+=Math.log(sample[i++]-gamma));
    final int n = u-l+1;
    return sum/n;
}

/**
 * Naive bisection implementation. Should always complete if the given values actually straddles the root.
 * Will call {@link #secant(DoubleUnaryOperator, double, double)} if they do not, in which case the
 * call may not complete.
 * @param func Function solve for root value
 * @param lowerLimit Some value for which the given function evaluates < 0
 * @param upperLimit Some value for which the given function evaluates > 0
 * @return x value, somewhere between the lower and upper limits, which evaluates close enough to zero
 */
private static double bisect(DoubleUnaryOperator func, double lowerLimit, double upperLimit) {
    final double eps = 0.000001;
    double low=lowerLimit;
    double valAtLow = func.applyAsDouble(low);
    double upp=upperLimit;
    double valAtUpp = func.applyAsDouble(upp);
    if (valAtLow*valAtLow>0) {
        // Switch to secant method
        return secant(func, lowerLimit, upperLimit);
    }
    System.out.printf("bisect %f@%f -- %f@%f%n", valAtLow, low, valAtUpp, upp);
    double mid;
    while(true) {
        mid = (upp+low)/2;
        if (Math.abs(upp-low)/low<eps)
            break;
        double val = func.applyAsDouble(mid);
        if (Math.abs(val)<eps)
            break;
        if (val<0)
            low=mid;
        else
            upp=mid;
    }
    return mid;
}

/**
 * Naive secant root solver implementation. May not complete if root not found.
 * @param f Function solve for root value
 * @param a Some value for which the given function evaluates
 * @param b Some value for which the given function evaluates
 * @return x value which evaluates close enough to zero
 */
static double secant(final DoubleUnaryOperator f, double a, double b) {
    double fa = f.applyAsDouble(a);
    if (fa==0)
        return a;
    double fb = f.applyAsDouble(b);
    if (fb==0)
        return b;
    System.out.printf("secant %f@%f -- %f@%f%n", fa, a, fb, b);
    if (fa*fb<0) {
        return bisect(f, a, b);
    }
    while ( abs(b-a) > abs(0.00001*a) ) {
          final double m = (a+b)/2;
          final double k = (fb-fa)/(b-a);
          final double fm = f.applyAsDouble(m);
          final double x = m-fm/k;
          if (Math.abs(fa)<Math.abs(fb)) {
              // f(a)<f(b); Choose x and a
              b=x;
              fb=f.applyAsDouble(b);
          } else {
              // f(a)>=f(b); Choose x and b
              a=x;
              fa=f.applyAsDouble(a);
          }
          if (fa==0)
              return a;
          if (fb==0)
              return b;
          if (fa*fb<0) {
              // Straddling root; switch to bisect method
              return bisect(f, a, b);
          }
      }
    return (a+b)/2;

}

Solved – How to choose t-distribution degrees of freedom in “robust” Bayesian linear models

The degrees of freedom in a linear regression model with Student-t errors are not fixed neither in the classical nor in the Bayesian approach. You are mixing up inference with hypothesis tests.

The formulation is as follows. You have $n$ response variables $y_1,\dots,y_n$, $n$ covariate vectors ${\bf x}_1,\dots,{\bf x}_n$ and $p+1$ regression parameters $\beta_0,\beta_1,\dots,\beta_p$. The regression model is:

$$y_j=\beta_0 +\beta_1 x_{j1}\dots + \beta_p x_{jp} + \epsilon_j,$$

where $\epsilon_j$ is distributed according to a Student t distribution with $\nu>0$ degrees of freedom and scale parameter $\sigma>0$. Then, the likelihood function is:

$$L(\beta_0,\dots,\beta_p,\sigma,\nu) \propto \prod_{j=1}^n \dfrac{1}{\sigma}f\left(\dfrac{y_j-{\bf x}^{\top}_j\beta}{\sigma}\Bigg\vert \nu\right),$$

where $f(\cdot\vert \nu)$ is the Student t density with $\nu$ degrees of freedom. The maximum likelihood estimator is obtained by maximising the likelihood function with respect to all the parameters.

In order to obtain Bayesian inference, you need a prior for the parameters in order to construct the posterior distribution in the usual way: $Posterior \propto Likehood \times Prior$. Typically, people use a Normal distribution for ${\bf \beta}$, an inverse gamma for $\sigma$, and for $\nu$ several noninformative priors have been proposed. Here is a list of options:

Jeffreys prior: http://biomet.oxfordjournals.org/content/95/2/325.short
Juarez-Steel prior: http://onlinelibrary.wiley.com/doi/10.1002/jae.1113/abstract
Discrete noninformative prior: https://projecteuclid.org/euclid.ba/1393251776
Noninformative prior based on kurtosis: http://projecteuclid.org/euclid.ejs/1440680330
Penalised complexity prior: http://arxiv.org/abs/1403.4630

Which one is better (and in what sense?)? That is an open question.

Best Answer

Related Solutions

Solved – Robust parameter estimation for shifted log normal distribution

Solved – How to choose t-distribution degrees of freedom in “robust” Bayesian linear models

Related Question