Regression Analysis – Does OLS Bias Sigma Estimate When Residuals Are Non-Normal Compared to MLE?

least squaresmaximum likelihoodrregressionself-study

I've come across a textbook problem from, Gelman – Regression and Other Stories that asks to compare MLE parameter values on a small dataset where n = 16

  year growth  vote inc_party_candidate other_candidate
1  1952   2.40 44.60           Stevenson      Eisenhower
2  1956   2.89 57.76          Eisenhower       Stevenson
3  1960   0.85 49.91               Nixon         Kennedy
4  1964   4.21 61.34             Johnson       Goldwater
5  1968   3.02 49.60            Humphrey           Nixon
6  1972   3.62 61.79               Nixon        McGovern
7  1976   1.08 48.95                Ford          Carter
8  1980  -0.39 44.70              Carter          Reagan
9  1984   3.86 59.17              Reagan         Mondale
10 1988   2.27 53.94           Bush, Sr.         Dukakis
11 1992   0.38 46.55           Bush, Sr.         Clinton
12 1996   1.04 54.74             Clinton            Dole
13 2000   2.36 50.27                Gore       Bush, Jr.
14 2004   1.72 51.24           Bush, Jr.           Kerry
15 2008   0.10 46.32              McCain           Obama
16 2012   0.95 52.00               Obama          Romney

The question asks;

' … write a function … that computes the logarithm of the likelihood (8.6) as a function of the data and the parameters a, b, sigma. Evaluate this function as several values of these parameters, and make a plot demonstrating that it is maximised at the values computed from the formulas in the text'

When doing this, with the code below, I find the MLE value for sigma is 5, while an OLS estimate value is 3.9.

I'm not entirely sure what exactly is causing this. My current understanding is that if the residual normality assumption is violated, OLS estimates for Sigma will be biased (paper here leading to unreliable inference). Given this, is the MLE estimate giving a better estimate of Sigma in this instance?

I do note the MLE estimate is within the 95% range for the OLS estimate;

c(
sqrt(((16-1)*3.9^2)/qchisq(c(.025),df=14, lower.tail=FALSE)),
sqrt(((16-1)*3.9^2)/qchisq(c(.975),df=14, lower.tail=FALSE))
)

2.955510 6.366565

library(dplyr)
library(ggplot2)
library(rprojroot)
library(rstanarm)
root<-has_file(".ROS-Examples-root")$make_fix_file()
df =read.table(root('ElectionsEconomy/data' , 'hibbs.dat'), header = TRUE) 
fit = stan_glm(vote ~ growth , data = df)
fit

mle = function(a,b,x,y,sigma){
  
    normal <- function(y,m,sigma){ 
      out = (1/(sqrt(2*pi*sigma)))*exp((-1/2)*((y-m)/sigma)^2)
      return(out)
    }
    le.df = data_frame('a' = vector(),
                       'b' = vector(),
                       'sigma' = vector(),
                       'mle.log' = vector()
                       )
    i = 1
    for(val.a in a){
      for(val.b in b){
        for(sig in sigma){
        m = val.a + val.b*x
        likeli = prod(normal(y , m , sig))
        le.df[i , ] = list(val.a,val.b, sig, log(likeli))
        i = i+1
        }
      }
    }
    return(le.df)
}
s = mle(coef(fit)[[1]], 
        coef(fit)[[2]],
        df$growth,
    df$vote,
        1:10
) 
ggplot(s, aes(sigma, mle.log)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept =s[s$mle.log == max(s$mle.log),]$sigma , color = 'red' )

ggplot(tibble(fit$residuals), aes(x = `fit$residuals`)) + 
  geom_density() +

stan_glm
 family:       gaussian [identity]
 formula:      vote ~ growth
 observations: 16
 predictors:   2
------
            Median MAD_SD
(Intercept) 46.2    1.7  
growth       3.0    0.7  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.9    0.7

Best Answer

The maximum likelihood estimate is: $$\hat\sigma_{MLE} = \sqrt{ \dfrac { \sum_{i = 1}^n (x_i - \bar x)^2 }{ n }} $$

By the OLS estimate, I assume you mean the square root of the unbiased variance estimate.

$$ \hat\sigma_{OLS}= \sqrt{ \dfrac { \sum_{i = 1}^n (x_i - \bar x)^2 }{ n - p }} $$

By Jensen's inequality, this is a biased estimator of that standard deviation, $\sigma$, of the $iid$ error terms, even though $\hat\sigma_{OLS}^2$ is unbiased for the variance, $\sigma^2$, of the $iid$ error terms.

Whether the residuals are skewed or not, these equations give different results.

Related Solutions

Regression – Handling OLS Residuals Not Normally Distributed

The ordinary least squares estimate is still a reasonable estimator in the face of non-normal errors. In particular, the Gauss-Markov Theorem states that the ordinary least squares estimate is the best linear unbiased estimator (BLUE) of the regression coefficients ('Best' meaning optimal in terms of minimizing mean squared error)as long as the errors

(1) have mean zero

(2) are uncorrelated

(3) have constant variance

Notice there is no condition of normality here (or even any condition that the errors are IID).

The normality condition comes into play when you're trying to get confidence intervals and/or $p$-values. As @MichaelChernick mentions (+1, btw) you can use robust inference when the errors are non-normal as long as the departure from normality can be handled by the method - for example, (as we discussed in this thread) the Huber $M$-estimator can provide robust inference when the true error distribution is the mixture between normal and a long tailed distribution (which your example looks like) but may not be helpful for other departures from normality. One interesting possibility that Michael alludes to is bootstrapping to obtain confidence intervals for the OLS estimates and seeing how this compares with the Huber-based inference.

Edit: I often hear it said that you can rely on the Central Limit Theorem to take care of non-normal errors - this is not always true (I'm not just talking about counterexamples where the theorem fails). In the real data example the OP refers to, we have a large sample size but can see evidence of a long-tailed error distribution - in situations where you have long tailed errors, you can't necessarily rely on the Central Limit Theorem to give you approximately unbiased inference for realistic finite sample sizes. For example, if the errors follow a $t$-distribution with $2.01$ degrees of freedom (which is not clearly more long-tailed than the errors seen in the OP's data), the coefficient estimates are asymptotically normally distributed, but it takes much longer to "kick in" than it does for other shorter-tailed distributions.

Below, I demonstrate with a crude simulation in R that when $y_{i} = 1 + 2x_{i} + \varepsilon_i$, where $\varepsilon_{i} \sim t_{2.01}$, the sampling distribution of $\hat{\beta}_{1}$ is still quite long tailed even when the sample size is $n=4000$:

set.seed(5678)
B = matrix(0,1000,2)
for(i in 1:1000)
{
    x = rnorm(4000) 
    y = 1 + 2*x + rt(4000,2.01)
    g = lm(y~x)
    B[i,] = coef(g)
}
qqnorm(B[,2])
qqline(B[,2])

enter image description here

Regression – Exploring Non-Normal Distributions That Equate OLS and MLE in Linear Regression

In maximum likelihood estimation, we calculate

$$\hat \beta_{ML}: \sum \frac {\partial \ln f(\epsilon_i)}{\partial \beta} = \mathbf 0 \implies \sum \frac {f'(\epsilon_i)}{f(\epsilon_i)}\mathbf x_i = \mathbf 0$$

the last relation taking into account the linearity structure of the regression equation.

In comparison , the OLS estimator satisfies

$$\sum \epsilon_i\mathbf x_i = \mathbf 0$$

In order to obtain identical algebraic expressions for the slope coefficients we need to have a density for the error term such that

$$\frac {f'(\epsilon_i)}{f(\epsilon_i)} = \pm \;c\epsilon_i \implies f'(\epsilon_i)= \pm \;c\epsilon_if(\epsilon_i)$$

These are differential equations of the form $y' = \pm\; xy$ that have solutions

$$\int \frac 1 {y}dy = \pm \int x dx\implies \ln y = \pm\;\frac 12 x^2$$

$$ \implies y = f(\epsilon) = \exp\left \{\pm\;\frac 12 c\epsilon^2\right\}$$

Any function that has this kernel and integrates to unity over an appropriate domain, will make the MLE and OLS for the slope coefficients identical. Namely we are looking for

$$g(x)= A\exp\left \{\pm\;\frac 12 cx^2\right\} : \int_a^b g(x)dx =1$$

Is there such a $g$ that is not the normal density (or the half-normal or the derivative of the error function)?

Certainly. But one more thing one has to consider is the following: if one uses the plus sign in the exponent, and a symmetric support around zero for example, one will get a density that has a unique minimum in the middle, and two local maxima at the boundaries of the support.

Best Answer

Related Solutions

Regression – Handling OLS Residuals Not Normally Distributed

Regression – Exploring Non-Normal Distributions That Equate OLS and MLE in Linear Regression

Related Question