Solved – binomial regression model that captures data with fat tails

fat-tailsheavy-tailedlogisticrregression

Specifically, are there any binomial regression models that use a kernel with heavier tails and higher kurtosis than the standard kernels (logistic/probit/cloglog)?

As a function of the linear predictor $\textbf{x}'\mathbf{\hat{\beta}}$, the logistic distribution

  • Underestimates the probability of my data being in the tails of the distribution
  • Underestimates the kurtosis, or clustering of data, in the middle of the distribution:

This can be seen from a diagnostic plot of my fit:

enter image description here

  • The red line is the logistic CDF, representing a perfect fit
  • The black line represents the fitted probabilities from my dataset (calculated by binning observations into 0.1 intervals of $\textbf{x}'\mathbf{\hat{\beta}}$, where $\mathbf{\hat{\beta}}$ is obtained from my fit)
  • The grey bars in the background represent number of observations on which the true probabilities are based upon
  • The grey areas are where the tail 10% of the data lie (5% each side).

Ideally, any solution would use R.

Edit

Why am I talking about CDFs? Our GLM equation is:

$$\mathbb{P}(Y = 1) = \mathbb{E}[Y] = g^{-1}(\textbf{x}'\mathbf{\beta})$$

Where $g$ is the link function.

Further, if $g^{-1}$ is a valid probability distribution (i.e. monotonically increasing from 0 to 1, indeed the case with probit, logit, cloglog), then consider a latent (not directly observed) continuous random variable $Y^{*}$ whose distribution (CDF) is given by $g^{-1}$. Then by definition

$$\mathbb{P}(Y^{*} \leq \textbf{x}'\mathbf{\beta}) = g^{-1}(\textbf{x}'\mathbf{\beta})$$

Equating the two equations above, we see the probability of $Y=1$ is exactly equal to the CDF of $Y^{*}$

$$\mathbb{P}(Y = 1) = \mathbb{P}(Y^{*} \leq \textbf{x}'\mathbf{\beta})$$

Hence I talk interchangeably about the expected response $\mathbb{E}[Y]$ and CDF of $Y^{*}$ over linear-predictor ($\textbf{x}'\mathbf{\hat{\beta}}$) space.

Best Answer

You can generalise the logistic regression model so that the latent distribution is something other than logistic. Using the t distribution lets you capture relationships where the data are contaminated, meaning observations from the "wrong" class appear unexpectedly far away from the decision boundary -- this would be the binary equivalent of fat tails in regression with a continuous response. IOW, you model

$$ \begin{align} y_i & = \begin{cases} 1 & \text{if } z_i > 0 \\ 0 & \text{if } z_i < 0 \end{cases}\\ z_i & = X_i\beta + \epsilon_i \end{align}$$

where the latent errors $\epsilon$ are distributed as $$\epsilon_i \sim t_\nu \left(0, \frac{\nu-2}{\nu} \right) $$

with $\nu > 2$ estimated from the data.

This is also called robit regression. In particular see pp.124-125 of Gelman & Hill's Data Analysis Using Regression And Multilevel/Hierarchical Models (what a mouthful); the above is eqn 6.15 in the book. You can fit it in R using package stan (which is also by Andrew Gelman and his collaborators and goes with the book), list and maybe others. You could also write code to maximise the log-likelihood directly using optim or nlminb; I haven't tried it, but it shouldn't be hard.